EECS 4340 — Final cheat sheet
# Introduction & Performance
- Latency vs. throughput — Latency = time per fixed task; throughput = tasks per unit time. Throughput exploits parallelism, latency cannot. Match the metric to the question.
- CPI — Average cycles per instruction. The microarchitect’s lever — pipelining, OoO, caches all move CPI.
- IPC — Reciprocal of CPI; higher is better. Used when reasoning about wide-issue/superscalar throughput.
- MIPS — Frequency / CPI / 10^6. Partial metric — a smarter compiler can lower MIPS while making the program faster, so never compare across ISAs/compilers with MIPS.
- Work T1 vs. critical path T_inf — Average parallelism = T1/T_inf. A p-wide machine takes at least max(T1/p, T_inf); critical path is the floor that adding cores cannot lower.
IC * CPI * T_cyc. Compiler owns IC, microarchitect owns CPI, circuit designer owns T_cyc. Every architectural change is a tradeoff among these three.
f = fraction of time accelerated, S = local speedup on that fraction. Ceiling is 1/(1-f) as S -> infinity, so a 90%-parallel program caps at 10x speedup no matter how many cores you throw at it.
C = switched capacitance, V = supply voltage (squared!), A = activity factor (target of clock gating), f = clock frequency. Under DVFS V tracks f, so dynamic power scales roughly as V^3 (or f^3).
Three averaging techniques from arithmetic, harmonic, and geometric means: arithmetic for quantities proportional to time (latencies, cycles); harmonic for rates (throughput, MIPS, bandwidth) — equals the arithmetic mean of the underlying times; geometric for unit-less ratios (speedups normalized to a baseline). The naive trap: averaging two speeds (e.g. 30 mph + 90 mph over equal distance) with the arithmetic mean gives 60, but the true average is the harmonic mean = 45.
Power (W) sets instantaneous limits — heatsink size, DVFS P-state, package TDP. Energy (J) = power x time, sets cumulative limits — battery life, datacenter electricity bill. A slower lower-watt design can use more total energy if leakage keeps flowing during the longer runtime (motivates ‘race-to-halt’). Combined metrics: EDP = Et penalizes slow designs; ED^2P = Et^2 is approximately voltage-invariant under DVFS, so it ranks designs fairly regardless of where the V/f knob is set.
# Basic instructions
| Mnemonic | Form | Action | Stages |
|---|---|---|---|
add | add rd, rs, rt | rd ← rs + rt | IF · ID · EX · WB |
sub | sub rd, rs, rt | rd ← rs − rt | IF · ID · EX · WB |
nand | nand rd, rs, rt | rd ← ¬(rs ∧ rt) | IF · ID · EX · WB |
nor | nor rd, rs, rt | rd ← ¬(rs ∨ rt) | IF · ID · EX · WB |
xor | xor rd, rs, rt | rd ← rs ⊕ rt | IF · ID · EX · WB |
slt | slt rd, rs, rt | rd ← (rs < rt) ? 1 : 0 | IF · ID · EX · WB |
sll | sll rd, rt, sh | rd ← rt ≪ sh | IF · ID · EX · WB |
- One-cycle ALU op in EX, writes rd in WB. No MEM access.
- Adjacent producer→consumer chains are covered by EX→EX forwarding, so a RAW between two R-type instructions costs zero stalls.
| Mnemonic | Form | Action | Stages |
|---|---|---|---|
addi | addi rt, rs, imm | rt ← rs + sext(imm) | IF · ID · EX · WB |
- Same datapath as R-type
add; the ALU’s second input is the sign-extended offset instead of rt. - The most common immediate form on the exams — used for loop counters, address arithmetic, and as the canonical “NOP-replacement” producer in hazard traces.
| Mnemonic | Form | Action | Stages |
|---|---|---|---|
lw | lw rt, off(rs) | rt ← M[rs + sext(off)] | IF · ID · EX · MEM · WB |
sw | sw rt, off(rs) | M[rs + sext(off)] ← rt | IF · ID · EX · MEM |
- EX computes the effective address; MEM does the data-cache access.
swdoes not write a register, so it skips WB. - Load-use RAW survives forwarding — the load’s data is only available out of MEM, so a dependent instruction in the next slot costs one stall cycle (1 CPI penalty). Schedule one independent instruction between
lwand its consumer to reclaim it.
| Mnemonic | Form | Action | Stages |
|---|---|---|---|
beq | beq rs, rt, off | if rs == rt: PC ← PC + 1 + sext(off) | IF · ID · EX |
j | j target | PC ← target | IF · ID |
jal | jal target | $ra ← PC + 1; PC ← target | IF · ID · WB |
- Branches resolve in EX (the comparator sits next to the ALU). Under static predict-not-taken, a taken branch squashes the two instructions already in IF and ID — a 2-cycle CPI penalty per misprediction.
jalwrites the return PC into$ra(R31) so a callee canjr $rato return; treat it as an unconditional jump that also produces a register write in WB.- Any unresolved branch fetched past is speculative until EX confirms the direction; smarter predictors (covered in the branch-prediction section) reduce the squash rate but don’t change the resolution stage.
| Mnemonic | Form | Action | Stages |
|---|---|---|---|
mul | mul rd, rs, rt | rd ← rs × rt | IF · ID · EX (multi-cycle FU) · WB |
div | div rd, rs, rt | rd ← rs ÷ rt | IF · ID · EX (multi-cycle FU) · WB |
- These do not fit the single-cycle EX of the standard 5-stage — they execute on a multi-cycle (or pipelined) functional unit alongside the ALU. Latency is several cycles (
multypically 3–5,div10–40) and is not always pipelined. - Variable EX latency is where in-order pipelines start to expose WAW and out-of-order completion problems — the trigger for scoreboarding/Tomasulo in later lectures.
# Pipelining
- RAW — true (flow) dependence — survives renaming; the only register hazard that fires in an in-order pipeline.
- WAR — anti-dependence — name-only; eliminated by register renaming.
- WAW — output dependence — name-only; also eliminated by renaming.
- structural — two instructions need the same hardware port in one cycle (e.g. unified mem).
- control — branch direction/target unknown when next fetch must occur.
- forwarding — EX/Mem and Mem/WB bypass paths kill most RAW stalls; one load-use stall remains.
n = instruction count, k = pipeline depth. Time for n insns is (n + k - 1) cycles vs. n*k unpipelined; S \to k as n \gg k.
After forwarding, the only data stall is a load immediately followed by a dependent instruction (1-cycle bubble). f_load is the load fraction; f_dep is the fraction of insns dependent on the immediately preceding one. Add control-hazard penalties if branches aren’t predicted.
Pipelining attacks T_{cyc} (shorter stages) while keeping CPI near 1 — but stalls and squashes inflate CPI, and deeper pipes amplify mispredict cost.
A data hazard fires when the consumer’s read stage (X) reaches its register before the producer’s write stage (Y) has finished. In the in-order 5-stage pipe, read = ID (2) and write = WB (5), so dist(X, Y) = 3: any consumer fewer than 3 instructions behind the producer hits a RAW hazard. WAR and WAW never fire in-order — they only become real once OoO or multi-cycle FUs let writes finish out of program order.
With full forwarding, the residual penalties on a 5-stage pipe are: (1) a load-use RAW costs exactly 1 stall cycle because the loaded value isn’t ready until end of MEM, and (2) a taken branch resolved in EX squashes the 2 instructions behind it under static “predict not-taken” — fixed by branch prediction or speculate-and-squash with a real predictor.
# Out-of-Order Execution
- scoreboard — CDC 6600 — no renaming; stalls on WAW at D, waits on WAR at W.
- Tomasulo — RS + CDB + value-copy renaming kills WAR/WAW; only RAW remains.
- reservation station — Per-FU buffer holding op + tags T1/T2 + value copies V1/V2 until ready.
- CDB — Single broadcast wire
(tag, value); every RS CAM-matches to snarf operands. - register renaming — Map architectural names to physical locations — eliminates WAR/WAW by construction.
- ROB — FIFO of in-flight insns in program order; in-order retire gives precise state.
- precise interrupts — Architectural state at fault = state after committing all older, none younger.
- speculation — Execute past unresolved branches/loads; squash via ROB on misprediction.
After renaming, only RAW edges count. Scoreboard counts RAW + WAW + WAR; Tomasulo counts RAW only — that gap is exactly why Tomasulo wins (e.g., 7/4 = 1.75 vs 7/3 = 2.33 on the canonical midterm trace).
| Hazard | In-order | Scoreboard | Tomasulo / P6 / R10K |
|---|---|---|---|
| RAW | stall EX | wait at S | wait at S (CDB snarf) |
| WAW | n/a | stall at D | gone (renaming) |
| WAR | n/a | wait at W | gone (value-copy / PRF) |
| structural | stall | stall at D (no FU/FUST) | stall at D (no RS/ROB/PR) |
RAW is the only data hazard that survives register renaming — it is a real producer→consumer edge, not a name conflict.
Both use Tomasulo-style RS plus a ROB for precise state — they differ in where values live.
- P6 (value-based): tag = ROB#. Operand values are copied ARF/ROB → RS at dispatch, FU → ROB at complete, ROB → ARF at retire. Recovery is free (flush everything); fast clocking is hard (lots of value movement).
- R10K (true renaming): tag = PR#. One unified physical register file; ROB holds only
TandTold. CDB carries tags, never values. Fast clocking is easy; recovery requires serial rollback (or branch checkpoints) usingToldto restore the map table. Free a physical register only when the next writer of the same architectural register retires.
Registers are not the only state — memory must be precise too. The store queue holds completed-but-unretired stores; the D-cache only sees retired stores in program order.
- Conservative (P6): loads wait until every older store address is known, then either forward from SQ (youngest matching older store wins) or read the cache.
- Opportunistic / speculative: loads execute past ambiguous-address stores. A load queue records load addresses so a later store-address resolve can detect a memory ordering violation and trigger a flush.
- Hybrid (Store-Sets, Alpha 21264): predict which loads alias which stores; wait conservatively only for those, speculate on the rest.
Empirically <10% of loads forward, so opportunistic + a small predictor wins on average.
# Branch Prediction
- BTB — PC-indexed cache of (predicted target, is-branch?). Handles direct branches and most indirect calls; partial tags OK because mispredicts are recoverable.
- BHT — PC-indexed table of direction bits (or 2-bit counters). No tags — aliasing is just another mispredict.
- 2-bit counter — Saturating 4-state FSM (SN/WN/WT/ST). Predict taken iff high bit = 1; needs two consecutive wrongs to flip polarity (hysteresis kills the inner-loop pathology of a 1-bit BHT).
- RAS — Tiny circular stack: push PC+4 on call, pop on return. The only structure that handles returns well — BTB is hopeless because each return PC sees many targets.
- BHR — k-bit shift register of recent T/N outcomes. Global (one shared) captures cross-branch correlation; per-PC (local) captures self-correlation. Must be checkpointed and rolled back on mispredict.
- gshare — GBHR XOR PC indexes a shared PHT of 2-bit counters. Cheap approximation of per-PC PHTs; basis of the Alpha 21264 hybrid and most modern descendants.
- hybrid (tournament) — Two predictors run in parallel + a meta-predictor (per-PC 2-bit counter) picks the more accurate one for each branch. McFarling/Alpha 21264 reaches 90–95% direction accuracy.
f_br = branch frequency (~0.2 in integer code); a = predictor accuracy; P = pipeline flush penalty in cycles (\approx pipeline depth). At f_br=0.2, a=0.95, P=10: \Delta CPI = 0.2 \cdot 0.05 \cdot 10 = 0.10 — a 10% penalty on top of CPI_ideal. Multiplicative trap: 0.95^3 \approx 0.857 across three branches in one wide fetch group.
k = log_2 of PHT size = BHR width. XOR (not concat) keeps the index k bits wide while still distinguishing PCs that share recent global history. Drop the 2 low PC bits (instruction-aligned) before XORing.
Local (per-PC) BHR captures self-correlation — perfect for tight inner loops whose outcome depends only on their own recent iterations (e.g. the for(j=0;j<3;j++) example: 3-bit local history reaches 100% steady-state accuracy after warming up the active patterns). Global BHR captures cross-branch correlation — needed when one branch’s outcome implies another’s (if(aa==2) aa=0; if(bb==2) bb=0; if(aa!=bb)). gshare gets most of both at GAg cost. Yeh-Patt’s {G,P}A{g,p,s} taxonomy: first letter = BHR partitioning, last letter = PHT partitioning.
Big accurate predictors (TAGE, hybrid) take 2–3 cycles, but fetch needs an answer every cycle. Solution: an overriding predictor uses a single-cycle bimodal/BTB for the initial prediction and squashes wrong-path insns when the slow predictor disagrees. Fast recovery itself uses a branch stack (snapshot of map table, free list, ROB/LSQ tails, BHR) plus a per-instruction b-mask — on a mispredict, every RS/FU entry whose mask bit is set for the offending branch is flash-squashed in one cycle, not by walking the ROB. Branches can resolve out-of-order; nested mispredicts work because each branch owns its own mask bit.
# Memory Hierarchy
- cache block — unit of transfer between levels (typ. 64 B). Address splits into tag | index | block-offset; bigger blocks help spatial locality but raise conflict/capacity misses.
- associativity — ways per set. 1-way = direct-mapped, fully-associative = 1 set. Targets conflict misses; returns saturate near 4–8-way.
- LRU — evicts oldest-touched way. Cheap for 2-way; NMRU/random/PLRU for higher associativity.
- write-back vs write-through — WB: dirty bit, only update next level on eviction (low bandwidth). WT: every store propagates (simple coherence).
- victim cache — small fully-associative buffer next to a low-associativity L1. Catches conflict-evicted lines, even 4–8 entries help.
- non-blocking cache + MSHRs — service hits/new misses while earlier misses pend. # MSHRs upper-bounds memory-level parallelism.
- critical-word first — deliver requested word as soon as it arrives, refill rest in background — cuts effective miss penalty.
- inclusion — L2 ⊇ L1 so external snoops only check L2; requires back-invalidation on L2 evictions.
- virtual memory — per-process private address space + demand paging, both built on address translation checked on every reference.
- TLB — cache of recent VPN→PPN translations + permission bits. Hit ≤ 1 pclk; miss triggers page-table walk.
- multi-level page table — VPN split per level so unused subtrees cost nothing. x86-64: 4 or 5 levels of 9-bit fields, 8-byte PTEs, 4 KB pages.
- VIPT vs PIPT — VIPT: index from page-offset bits → TLB and cache fire in parallel; max safe size = page × associativity. PIPT: TLB serial — slower hit but no aliasing.
- page fault — translation missing or page swapped to disk; OS handles in software, costs ~10⁷ cycles.
- prefetching — speculative fetch ahead of demand. Targets compulsory/capacity/coherence misses; never conflict.
m = miss rate (in [0,1]); t_miss is the miss penalty. Multi-level: replace t_miss with the next level’s AMAT, e.g. AMAT = t_{L1} + m_{L1},(t_{L2} + m_{L2},t_{mem}). Beware local vs global miss-rate: local = misses_{Li} / accesses_{Li}, global = misses_{Li} / total CPU refs.
C = capacity (bytes), a = associativity (ways), b = block size (bytes), S = #sets. Direct-mapped: a=1. Fully-associative: S=1, no index. Example: 32 KB, 8-way, 64 B → S = 32768/(8·64) = 64 sets → 6 index bits, 6 offset bits.
Page offset = log2(page size); same in VA and PA. Flat single-level PT for 64-bit VA with 4 KB pages and 8 B PTEs is 2^{52}·8 ≈ 32 PB — why we use multi-level or inverted tables. L = #levels of page table; each level is a memory access on a TLB miss (e.g. x86-64: 4 levels ≈ 4 dependent loads).
Mark Hill’s miss taxonomy — each C is defined by the experiment that eliminates it: compulsory (cold first-touch; survives an infinite cache) — attack with bigger blocks or prefetching; capacity (working set > cache; survives in fully-associative of same size) — attack with bigger or better replacement; conflict (residual after capacity is removed; only in set-associative or direct-mapped) — attack with higher associativity, victim cache, or skewed associativity; coherence (another core invalidated the line) — only shows up in multiprocessors. Standard write-policy pairings: write-back + write-allocate (modern L1/L2 — coalesce repeated stores, hit-rate-friendly); write-through + no-write-allocate (simple coherence, low complexity, but bandwidth scales with store fraction regardless of miss rate).
Hardware prefetcher families, in increasing pattern complexity: next-line — on miss for line X also fetch X+1; trivial, dominates I-cache, weak on D-cache. Stride — per-PC Reference Prediction Table records last address + last stride; on a confirmed stride match prefetch addr + s, addr + 2s, …; captures non-unit/negative strides through arrays. Stream buffer (Jouppi) — FIFO of sequential lines beyond a detected stream sitting next to the cache, so wrong streams cannot pollute L1; lookup checks FIFO heads. Address-correlating / delta-correlation — Markov-style table or GHB-walked linked list keyed by recent address(es) or deltas, predicting the next address(es) from history; covers irregular-but-repetitive sequences. Spatial (SMS) — record per-region offset bitmap keyed by triggering load PC, replay on next region. Linked-data (Roth) — learn load-to-load value dependences (p = p->next) and run an FSM ahead through the structure.
# Multiprocessors
- MSI — three states {Modified, Shared, Invalid}; M = exclusive & dirty (memory stale), S = clean read-only (memory current), I = no copy.
- MESI — MSI + Exclusive: clean line held by exactly one cache; lets a read-then-write avoid the BusInv since no other sharer exists.
- snooping — shared bus is the serialization point; every cache snoops every transaction. Simple; bus saturates beyond ~16 cores.
- directory — home-node directory tracks sharers per line; point-to-point invalidates only to caches that actually share. Scales to many-core; longer hit latency.
- write-back — the Modified state’s whole point — repeated stores to a private line are local hits; BusWB only on eviction.
- write-through — every store hits the bus (e.g. trivial valid/invalid protocol). Simple but bandwidth-hostile.
- lock — mutual exclusion; built atop hardware atomics (test-and-set, CAS, LL/SC). Spin under low contention, block under high — hybrid for variable workloads.
- barrier — all N threads must arrive before any proceeds; foundational to BSP / SPMD phases. Cost grows with N; centralized counters cause cache-line ping-pong.
- transactional memory — optimistic concurrency: speculate through the critical section and roll back on conflict. Wins when contention is rare and critical sections are short.
- shared memory — single global address space; communication is implicit via loads/stores; needs hardware coherence + a memory model.
- message passing — private address spaces with explicit send/recv (MPI). No coherence needed; scales to clusters.
Cache coherence is the contract for one address: every cache eventually agrees on its value, and writes to that address are seen in some single global order. The memory consistency model is the contract across addresses: the order in which one thread’s loads/stores to different locations become visible to other threads. Sequential consistency (Lamport) demands one global interleaving of all ops in program order. Real hardware is weaker — x86 TSO lets a store sit in the store buffer past a younger load; ARM/RISC-V use release consistency where ordering is only guaranteed across explicit fences/acquire-release. Coherence is automatic; consistency requires fences and atomics from the programmer.
Snooping hits two scaling walls: bus bandwidth and per-cache snoop bandwidth (most snoops are ‘I don’t have it’). Directory protocols fix both — replace the bus with a point-to-point on-chip network (mesh/ring) and the broadcast with a per-line sharer list at the home node. Cost: an extra network round-trip on every miss (requestor -> home -> sharers -> requestor) and storage for the sharer vector. Used in any modern 16+ core CMP, NUMA SMPs, and SGI-style machines. Snooping still wins for small core counts because of its simplicity and lack of indirection.
MESI is a write-invalidate protocol: a writer sends BusRdX/BusInv to invalidate every other sharer, then writes locally in Modified. Write-update (e.g. Dragon, Firefly) instead broadcasts the new value so other sharers stay valid — better when many readers consume each write, worse for migratory data. Real hardware almost universally chose invalidate. False sharing is the pathology you must recognize: two threads write different fields that happen to live in the same cache block. The block ping-pongs M -> I -> M between caches on every store even though the threads share no data. Fix by padding hot fields to a cache-line boundary.
# Exam tips
Every exam in the corpus has at least one question that hinges on this: register renaming (and therefore Tomasulo / P6 / R10k) eliminates only WAW and WAR name dependences. RAW is a true data-flow edge — the consumer must wait for the producer’s value, full stop. When you compute ILP under Tomasulo, drop only WAW/WAR edges from the dependence graph; the longest remaining RAW chain sets the bound.
On the T,T,T,T,T,N,N pattern in the Spring 2024 final, the 2-bit saturating counter reaches weakly not taken by the end of the period and mispredicts the first T of the next period — accuracy is 4/7, not 5/7. Whenever you analyze a periodic pattern, walk one full period and check the predictor’s state at the boundary; the misprediction at the wrap-around is the one students forget.