Topics
|
Core BLOOM level
|
Learning Outcomes and Teaching Suggestions (core)
|
Advanced Bloom Level
|
Learning Outcome (advanced)
|
Where Covered
|
Classes of Parallelism
|
|
|
|
|
|
|
|
Taxonomy
|
C
|
Flynn's taxonomy, data vs. control parallelism, shared/distributed memory
|
C
|
Arch2: Flynn's taxonomy, data vs. control parallelism, shared/distributed memory (additional depth)
|
Systems; Arch2
|
|
Data parallelism
|
|
|
|
|
|
|
|
Superscalar (ILP)
|
K
|
Describe opportunities for multiple instruction issue and execution (different instructions on different data)
|
A
|
Explain how superscalar works, how to schedule instructions for wider issue
|
Systems; Arch2
|
|
|
SIMD/Vector (e.g., AVX, GPU, TPU)
|
K
|
Describe uses of SIMD/Vector (same operation on multiple data items), e.g., accelerating graphics for games
|
C
|
Know the relationship of vector extensions to separate accelerators, and how they operate
|
Systems; Arch2
|
|
|
SIMD/Vector energy effects
|
K
|
Saves energy by sharing one instruction over many instruction executions, whether in parallel (SIMD) or pipelined (vector)
|
K
|
Arch2: Saves energy by sharing one instruction over many instruction executions, whether in parallel (SIMD) or pipelined (vector) (additional depth)
|
Systems; Arch2
|
|
|
Dataflow
|
N
|
|
K
|
Be aware of this alternative execution paradigm
|
Arch2
|
|
Pipelines
|
|
|
|
|
|
|
|
Basic structure and organization
|
C
|
Describe basic pipelining process (multiple instructions can execute at the same time), describe stages of instruction execution
|
C
|
Arch2: Describe basic and more advanced pipelining process (including superpipelining and superscalar issue and commit, e.g., scoreboard, reservation stations, reorder buffer), describe stages of instruction execution (additional depth, such as forwarding, multithread handling, shelving)
|
Systems; Arch2
|
|
|
Data and control hazards
|
K
|
Show examples of how one pipe stage can depend on a result from another, or delayed branch resolution can start the wrong instructions in a pipe that can reduce performance
|
C
|
Compilers: Instruction scheduling for modern pipe systems.
Arch2: Pipeline effects including stalling, shelving, and restarting. Mechanisms for avoiding performance losses such as branch prediction, predication, forwarding, multithreading.
DistSystems: understanding equivalent ideas as expressions of time and local versus global knowledge
|
Systems; Compilers; Arch2; DistSystems
|
|
|
OoO execution, speculation
|
N
|
|
C
|
Be able to work through examples of Scoreboard, Reservation stations, Reorder buffers, difference between complete and commit, speculation (including security issues)
|
Arch2
|
|
|
Streams (e.g., GPU)
|
K
|
Know that stream-based architecture exists in GPUs
|
C
|
Be able to describe basic GPU architecture with streaming multiprocessors, threads, warps, memory hierarchy
|
Systems; Arch2
|
|
Control parallelism
|
|
|
|
|
|
|
|
MIMD
|
K
|
Identify MIMD instances in practice (multicore, cluster, e.g.), and know the difference between execution of tasks and threads
|
|
|
Systems
|
|
|
Multi-Threading
|
K
|
Distinguish multithreading from multicore (based on which resources are shared)
|
C
|
Be able to explain the difference between coarse and fine grain multithreading, and simultaneous multithreading.
|
Systems; Arch2
|
|
|
Scaling of Multithreading (e.g., GPU, IBM Power series)
|
N
|
|
K
|
Have an awareness of the potential and limitations of thread level parallelism in different kinds of applications
|
Arch2
|
|
|
Multicore
|
C
|
Describe how cores share resources (cache, memory) and resolve conflicts
|
C
|
Be able to describe relationship of cores to memory hierarchy, reason for switching from increasing clock rate to increasing core count
|
Systems; Arch2
|
|
|
Heterogeneous (e.g., GPU, security, AI, low-power cores)
|
K
|
Recognize that multicore may not all be the same kind of core (mix of organizations, instruction sets, high level explanation of benefits and costs))
|
C
|
Be able to describe performance benefits from integrating heterogeneous cores on a chip in context of applications with heterogeneous parallelism. Be able to cite reasons for some high performance accelerators to be on separate chips, due to need for greater silicon real estate
|
Systems; Arch2
|
|
|
Heterogeneous architecture energy (small vs. big cores, CPU vs. GPU, accelerators)
|
K
|
Know that heterogeneity saves energy by using power-efficient cores when there is sufficient parallelism
|
C
|
Enumerate energy benefits from heterogeneity in architecture. Illustrate power/performance tradeoff. Discuss some combination of high variability with small feature sizes and reliability issues with lower voltages, redundancy techniques to enhance reliability, performance/reliability and power/reliability tradeoffs.
|
Systems; Arch2
|
|
Shared memory
|
|
|
|
|
|
|
|
Symmetric Multi-Processor (SMP)
|
K
|
Explain concept of uniform access shared memory architecture
|
|
|
Systems
|
|
|
Buses for shared-memory implementation
|
C
|
Be able to explain issues related to shared memory being a single resource, limited bandwidth and latency, snooping, and scalability
|
C
|
Arch2: Describe different arbitration schemes and tradeoffs.
DistSystems: Explain need for symmetry breaking.
|
Systems; Arch2; DistSystems
|
|
|
Broadcast (snooping)-based, Cache-Coherent Non-Uniform Memory Access (CC-NUMA)
|
N
|
|
K
|
Be aware that caches in the context of shared memory depend on coherence protocols, and understand idea of snooping (protocols are addressed in a separate topic)
|
Arch2
|
|
|
Directory-based CC-NUMA
|
N
|
|
C
|
Arch2: Be aware that bus-based sharing doesn’t scale, and directories offer an alternative, based on distributed memory hardware. DistSystems: see how shared memory can be extended to a loosely coupled systems with appropriate consistency model.
|
Arch2; DistSystems
|
|
Distributed memory
|
|
|
|
|
|
|
|
Message passing (no shared memory)
|
N
|
|
K
|
Arch2: Shared memory architecture breaks down when scaled due to physical limitations (latency, bandwidth) and results in message passing architectures.
ParProg: Be aware of the effect of network hardware support in context of e.g., MPI, PGAS implementation.
|
Arch2; ParProg
|
|
|
Topologies
|
N
|
|
C
|
Algo2: Various graph topologies - linear, ring, mesh/torus, tree, hypercube, clique, crossbar and their properties (e.g., diameter, bisection width, etc.).
DistSystems understand network as means of communication among distributed nodes, and effects of topology.
|
Algo2; ParProg; DistSystems
|
|
|
Latency
|
K
|
Know the concept, implications for scaling, impact on work/communication ratio to achieve speedup. Awareness of aspects of latency in parallel systems and networks that contribute to the idea of time accumulating over dependent parts or distance
|
C
|
Algo2: Give examples of how communication latency contributes to complexity analysis of an algorithm.
DistSystems: Show awareness of hardware aspects of latency in parallel systems and networks that contribute to the concepts of distributed time, and of local versus global knowledge. Differences between synchronous and asynchronous systems.
|
Systems; Algo2; DistSystems
|
|
|
Bandwidth
|
K
|
Know the concept, how it limits sharing, and considerations of data movement cost
|
|
|
Systems
|
|
|
Circuit and packet switching
|
N
|
|
C
|
Arch2: know that interprocessor communication can be managed using switches in networks of wires to establish different point-to-point connections, that the topology of the network affects efficiency, and that some connections may block others. Know that interprocessor communications can be broken into packets that are redirected at switch nodes in a network, based on header info.
Networking: Be able to explain basic functionality and limitations within routers
|
Arch2; Networking
|
|
|
Routing
|
N
|
|
C
|
Arch2: know that messages in a network must follow an algorithm that ensures progress toward their destinations, and be familiar with common techniques such as store-and-forward, or wormhole routing.
Networking: Be able to explain wide area routing, packet latency, fault tolerance.
DistSystems: see impact on distributed systems, see adaptive routing as instance of load balancing
|
Arch2; Networking; DistSystems
|
Underlying Mechanisms
|
|
|
|
|
|
|
|
Caching
|
C
|
Know the cache hierarchies, and that shared caches (vs. private caches) result in coherency and performance issues for software
|
C
|
Arch2: Explain why coherence is complicated by synonym resolution within a virtual memory system. Know that TLBs are affected by sharing in a multicore context. Be aware of emerging non-volatile memory technology in the hierarchy.
DistSystems: Show how network latency can be hidden by caching, but that it also introduces coherence issues
|
Systems; Arch2; DistSystems
|
|
|
Atomicity
|
K
|
CS2: Show awareness of the significance of thread-safe data structures in libraries.
Systems: Explain the need for atomic operations and their use in synchronization, both in local and distributed contexts.
|
C
|
Arch2: Understand implementation mechanisms for indivisible memory access operations in a multicore system OS: be able to use atomic operations in management of critical sections locally, in a multiprocessor. DistSystems: Be able to recognize or give examples of atomicity in a distributed system
|
CS2; Systems; Arch2; OS; DistSystems
|
|
|
Consistency
|
N
|
|
C
|
Arch2: be able to explain implementation mechanisms for consistent views of data in sharing; OS: Be able to explain consistency from a process management perspective; ParProg: Be able to explain consistency from the perspective of language or library models; DistSystems: Be able to explain consistency from perspective of coping with high latency
|
Arch2; OS; ParProg; DistSystems
|
|
|
Coherence
|
N
|
|
C
|
Arch2: examine protocols and their implementation and performance differences.
ParProg, OS: describe how cores share cache and resolve conflicts.
DistSystems: protocols for wide area coherence
|
Arch2; OS; ParProg; DistSystems
|
|
|
● False sharing
|
N
|
|
C
|
Arch2: explain in context of excess coherence traffic.
ParProg: learn how to avoid by structuring data and orchestrating access.
DistSystems: explain how it can occur with remote data.
|
Arch2; ParProg; DistSystems
|
|
|
Interrupts and event handling
|
K
|
CS2: know that event handling is as an aspect of graphical user interfaces.
Systems: know that I/O is mostly interrupt-driven.
|
C
|
Arch2: know how interrupts are implemented, including the vector table, masking, and timers used in multiprocessing. Be able to explain handshaking protocols.
OS: be able to work with interrupt handling, including from other processors and network sources, and the use of timer interrupts in multiprocessors.
DistSystems: be able to use network interrupts in a distributed system and understand handshaking protocols for communications
|
CS2, Systems, Arch2; OS; DistSystems
|
|
|
Handshaking
|
K
|
Know the difference between synchronous and asynchronous communication protocols, including context of multiple processors.
|
C
|
Arch2: able to explain handshaking protocols for buses.
Networking: able to explain handshaking in network protocols.
DistSystems: be able to use handshaking protocols for communication
|
Systems; Arch2; Networking; DistSystems
|
|
|
Process ID, System/Node ID
|
K
|
Explain need for a process identifier to enable interprocess communication, and for a node identifier in initialization of multiprocessor systems
|
C
|
Arch2: know how process ID is implemented, see need for node ID for symmetry breaking.
ParProg: be able to use process ID for communication, node ID for symmetry breaking in programs.
DistSystems: know how node ID is a key element of local knowledge
|
Systems; Arch2; ParProg; DistSystems
|
|
|
Virtualization Support
|
N
|
|
C
|
Arch2, OS, DistSystems: Enumerate pros and cons of virtualization for security, load balancing, quality of service management, and/or issues of supporting with good performance and security
|
Arch2; OS; DistSystems
|
Floating Point Representation
|
|
These topics are supposed to be in the ACM/IEEE core curriculum already – they are included here to emphasize their importance, especially in the context of PDC for large problems.
|
|
|
|
|
|
Range
|
K
|
Explain why range is limited, implications of infinities
|
|
|
CS1; CS2; Systems
|
|
|
Precision
|
K
|
How single and double precision floating point numbers impact software performance, basic ways to avoid loss of precision in calculations
|
|
|
CS1; CS2; Systems
|
|
|
Rounding issues
|
N
|
|
C
|
Arch2: Explain differences in rounding modes.
Algo2: Give examples of accumulation of error and loss of precision.
ParProg: Explain need for numerical analysis to ensure that accumulation of error in long chains of calculation over large data sets is controlled
|
Arch2; Algo2; ParProg
|
|
|
Error propagation
|
K
|
Able to explain NaN, Infinity values and how they affect computations and exception handling
|
|
|
CS2
|
|
|
IEEE 754 standard
|
K
|
Representation, range, precision, rounding, NaN, infinities, subnormals, comparison, effects of casting to other types
|
|
|
CS1, CS2, Systems
|
Performance Metrics
|
|
|
|
|
|
|
|
Instructions per cycle
|
C
|
Explain pros and cons of IPC as a performance metric, various pipelined implementations
|
C
|
Describe impact of multithreading on IPC calculation
|
Systems; Arch2
|
|
|
Benchmarks
|
K
|
Awareness of various benchmarks (such as LinPack, NAS parallel) and how they test different aspects of performance in parallel systems
|
|
|
Systems
|
|
|
● Bandwidth benchmarks
|
N
|
|
K
|
Be aware that there are benchmarks focusing on data movement instead of computation, and that there are different architecture aspects that contribute to bandwidth, and that bandwidth is with respect to the source and destination points across which it is measured
|
Arch2
|
|
|
Memory bandwidth
|
N
|
|
C
|
Be able to explain significance of memory bandwidth with respect to multicore access, and different contending workloads, and challenge of measuring
|
Arch2
|
|
|
Network bandwidth
|
N
|
|
C
|
Know how network bandwidth is specified and explain the limitations of the metric for predicting performance, given different workloads that communicate with different contending patterns
|
Arch2
|
|
|
Peak performance
|
C
|
Explain what peak performance is and how it is rarely valid for estimating real performance, illustrate fallacies
|
|
|
Systems
|
|
|
● MIPS/FLOPS
|
K
|
Know meaning of terms
|
|
|
Systems
|
|
|
Sustained performance
|
C
|
Know difference between peak and sustained performance, how to define, measure, different benchmarks
|
|
|
Systems
|
Power
|
|
|
|
|
|
|
|
Power, Energy
|
C
|
Be able to explain difference between power (rate of energy usage) and energy
|
|
|
Systems
|
|
|
Large scale systems, distributed embedded systems
|
N
|
|
K
|
Know about active power states and power reduction of CPUs using dynamic voltage and frequency scaling (DVFS), idle power management using sleep states. In an advanced architecture course, could cover link power reduction of multilane serial links and power management as it relates to large scale systems and low-power IoT type devices
|
Arch2
|
|
|
Power density
|
N
|
|
K
|
Be aware that power density can create challenges in thermodynamics of chip designs, for cooling technologies, or for the supply of power to a compute unit (e.g. a rack). Power density challenges for modern architectures
|
Arch2
|
|
|
Static vs. dynamic power
|
N
|
|
C
|
Grasp the concept of how some energy usage is relatively constant, and some varies with compute load or data content (1's and 0's). Explain the significance of leakage current as chip feature size shrinks and how scaling up core count may require reduced clock rate to compensate
|
Arch2
|
|
|
Clock scaling / clock gating / Power gating /thermal controls
|
N
|
|
K
|
Know that clock scaling/gating is a technology for reducing the power used for latches and other storage units that do not need to be changed. Know that modern systems have thermal sensors that guide clock scaling
|
Arch2
|
|
|
Power efficient HW design: simpler cores, short pipelines
|
N
|
|
C
|
Explain the power-efficiency advantages of simpler designs. To apply, could engage in an exercise using out-of-order core vs. in-order core, providing hypothetical performance and energy values.
|
Arch2
|
Scaling (HPC, Big Data)
|
|
|
|
|
|
|
|
Reliability and fault tolerance issues (cross cutting topics of current curriculum)
|
N
|
|
K
|
Large-scale parallel/distributed hardware/software systems are prone to components failing but the system as a whole needs to work.
|
Arch2
|
|
|
Hardware support for data bound computation
|
N
|
|
C
|
May include caching, prefetching, RAID storage, high performance networks such as Infiniband. Know that modern architectures are starting to provide hardware support for operations that are crucial for data science. May include GPU collective operations, memory access coalescing
|
Arch2
|
|
|
Hardware limitations for data bound computation
|
K
|
Comprehend the hardware limitations in the context of Big data and their most relevant v's (volume, velocity)
|
C
|
Arch2: considerations of memory system bandwidth, latency hiding, network bandwidth, direct interprocessor links (e.g., QPI, GPUDirect).
ParProg: be able to use tools to identify data transfer bottlenecks.
|
Systems; Arch2; ParProg
|
|
|
Pressure imposed by data volume
|
K
|
Know of the limitations in terms of storage, memories, and filesystems when dealing with very large data volumes
|
C
|
Arch2: Compare and contrast limitations of different storage technologies.
ParProg: be aware of architectural aspects of large systems that affect structuring computation to enable load balancing, and optimizations for throughput.
|
Systems; Arch2; ParProg
|
|
|
Pressure imposed by data velocity
|
K
|
Know of the limitations in bandwidth, memory, and processing when processing very fast data streams. Include networking and reliability issues
|
C
|
Arch2: Explain the need to balance system resources to avoid bottlenecks in processing fast data streams.
ParProg: Explain architectural aspects of data velocity issues of load balancing, jitter, and skew in coordinating flows of data.
|
Systems; Arch2; ParProg
|
|
|
Cache growth with scaling out
|
N
|
|
K
|
Know the difference between scaling up and scaling out. As the number of nodes scales, total cache space increases, which can allow partionable problems to also scale more effectively (giving appearance of super-linear speedup), and may impact inter-node communication
|
Arch2
|
|
|
Cost of data movement across memory hierarchies
|
C
|
Comprehend the difference in speed vs cost (latency, energy) of accessing the different memory hierarchies, and of changing their sizes, in the context of multiple processors and hierarchies
|
C
|
Arch2: Enumerate factors affecting bandwidth, latency, power.
ParProg: Describe impact on synchronization, load balancing, tools for measurement. How increased core counts may be seen as a means to increase total cache space in a large system.
|
Systems; Arch2; ParProg
|