EECS 4340 — Final cheat sheet | EECS 4340 Final Review

# Introduction & Performance

Latency vs. throughput — Latency = time per fixed task; throughput = tasks per unit time. Throughput exploits parallelism, latency cannot. Match the metric to the question.
CPI — Average cycles per instruction. The microarchitect’s lever — pipelining, OoO, caches all move CPI.
IPC — Reciprocal of CPI; higher is better. Used when reasoning about wide-issue/superscalar throughput.
MIPS — Frequency / CPI / 10^6. Partial metric — a smarter compiler can lower MIPS while making the program faster, so never compare across ISAs/compilers with MIPS.
Work T1 vs. critical path T_inf — Average parallelism = T1/T_inf. A p-wide machine takes at least max(T1/p, T_inf); critical path is the floor that adding cores cannot lower.

Iron law

\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}

IC * CPI * T_cyc. Compiler owns IC, microarchitect owns CPI, circuit designer owns T_cyc. Every architectural change is a tradeoff among these three.

Amdahl's Law

S_{\text{overall}} = \frac{1}{(1-f) + f/S}

f = fraction of time accelerated, S = local speedup on that fraction. Ceiling is 1/(1-f) as S -> infinity, so a 90%-parallel program caps at 10x speedup no matter how many cores you throw at it.

Dynamic CMOS power

P_{\text{dyn}} \sim \tfrac{1}{2} C V^2 A f

C = switched capacitance, V = supply voltage (squared!), A = activity factor (target of clock gating), f = clock frequency. Under DVFS V tracks f, so dynamic power scales roughly as V^3 (or f^3).

Three averaging techniques from arithmetic, harmonic, and geometric means: arithmetic for quantities proportional to time (latencies, cycles); harmonic for rates (throughput, MIPS, bandwidth) — equals the arithmetic mean of the underlying times; geometric for unit-less ratios (speedups normalized to a baseline). The naive trap: averaging two speeds (e.g. 30 mph + 90 mph over equal distance) with the arithmetic mean gives 60, but the true average is the harmonic mean = 45.

Power (W) sets instantaneous limits — heatsink size, DVFS P-state, package TDP. Energy (J) = power x time, sets cumulative limits — battery life, datacenter electricity bill. A slower lower-watt design can use more total energy if leakage keeps flowing during the longer runtime (motivates ‘race-to-halt’). Combined metrics: EDP = Et penalizes slow designs; ED^2P = Et^2 is approximately voltage-invariant under DVFS, so it ranks designs fairly regardless of where the V/f knob is set.

# Basic instructions

Mnemonic	Form	Action	Stages
`add`	`add rd, rs, rt`	rd ← rs + rt	IF · ID · EX · WB
`sub`	`sub rd, rs, rt`	rd ← rs − rt	IF · ID · EX · WB
`nand`	`nand rd, rs, rt`	rd ← ¬(rs ∧ rt)	IF · ID · EX · WB
`nor`	`nor rd, rs, rt`	rd ← ¬(rs ∨ rt)	IF · ID · EX · WB
`xor`	`xor rd, rs, rt`	rd ← rs ⊕ rt	IF · ID · EX · WB
`slt`	`slt rd, rs, rt`	rd ← (rs < rt) ? 1 : 0	IF · ID · EX · WB
`sll`	`sll rd, rt, sh`	rd ← rt ≪ sh	IF · ID · EX · WB

One-cycle ALU op in EX, writes rd in WB. No MEM access.
Adjacent producer→consumer chains are covered by EX→EX forwarding, so a RAW between two R-type instructions costs zero stalls.

Mnemonic	Form	Action	Stages
`addi`	`addi rt, rs, imm`	rt ← rs + sext(imm)	IF · ID · EX · WB

Same datapath as R-type add; the ALU’s second input is the sign-extended offset instead of rt.
The most common immediate form on the exams — used for loop counters, address arithmetic, and as the canonical “NOP-replacement” producer in hazard traces.

Mnemonic	Form	Action	Stages
`lw`	`lw rt, off(rs)`	rt ← M[rs + sext(off)]	IF · ID · EX · MEM · WB
`sw`	`sw rt, off(rs)`	M[rs + sext(off)] ← rt	IF · ID · EX · MEM

EX computes the effective address; MEM does the data-cache access. sw does not write a register, so it skips WB.
Load-use RAW survives forwarding — the load’s data is only available out of MEM, so a dependent instruction in the next slot costs one stall cycle (1 CPI penalty). Schedule one independent instruction between lw and its consumer to reclaim it.

Mnemonic	Form	Action	Stages
`beq`	`beq rs, rt, off`	if rs == rt: PC ← PC + 1 + sext(off)	IF · ID · EX
`j`	`j target`	PC ← target	IF · ID
`jal`	`jal target`	$ra ← PC + 1; PC ← target	IF · ID · WB

Branches resolve in EX (the comparator sits next to the ALU). Under static predict-not-taken, a taken branch squashes the two instructions already in IF and ID — a 2-cycle CPI penalty per misprediction.
jal writes the return PC into $ra (R31) so a callee can jr $ra to return; treat it as an unconditional jump that also produces a register write in WB.
Any unresolved branch fetched past is speculative until EX confirms the direction; smarter predictors (covered in the branch-prediction section) reduce the squash rate but don’t change the resolution stage.

Mnemonic	Form	Action	Stages
`mul`	`mul rd, rs, rt`	rd ← rs × rt	IF · ID · EX (multi-cycle FU) · WB
`div`	`div rd, rs, rt`	rd ← rs ÷ rt	IF · ID · EX (multi-cycle FU) · WB

These do not fit the single-cycle EX of the standard 5-stage — they execute on a multi-cycle (or pipelined) functional unit alongside the ALU. Latency is several cycles (mul typically 3–5, div 10–40) and is not always pipelined.
Variable EX latency is where in-order pipelines start to expose WAW and out-of-order completion problems — the trigger for scoreboarding/Tomasulo in later lectures.

# Pipelining

RAW — true (flow) dependence — survives renaming; the only register hazard that fires in an in-order pipeline.
WAR — anti-dependence — name-only; eliminated by register renaming.
WAW — output dependence — name-only; also eliminated by renaming.
structural — two instructions need the same hardware port in one cycle (e.g. unified mem).
control — branch direction/target unknown when next fetch must occur.
forwarding — EX/Mem and Mem/WB bypass paths kill most RAW stalls; one load-use stall remains.

k-stage pipeline speedup

S = \frac{n \cdot k}{n + k - 1}

n = instruction count, k = pipeline depth. Time for n insns is (n + k - 1) cycles vs. n*k unpipelined; S \to k as n \gg k.

Load-use CPI (forwarded in-order pipe)

\text{CPI} = 1 + f_{load} \cdot f_{dep} \cdot 1

After forwarding, the only data stall is a load immediately followed by a dependent instruction (1-cycle bubble). f_load is the load fraction; f_dep is the fraction of insns dependent on the immediately preceding one. Add control-hazard penalties if branches aren’t predicted.

Iron law

\text{Time} = \text{IC} \times \text{CPI} \times T_{cyc}

Pipelining attacks T_{cyc} (shorter stages) while keeping CPI near 1 — but stalls and squashes inflate CPI, and deeper pipes amplify mispredict cost.

Classic 5-stage pipeline (IF | ID | EX | MEM | WB) with the two forwarding paths (EX/Mem -> ALU and Mem/WB -> ALU) and the WB -> ID register-file write that closes the loop.

A data hazard fires when the consumer’s read stage (X) reaches its register before the producer’s write stage (Y) has finished. In the in-order 5-stage pipe, read = ID (2) and write = WB (5), so dist(X, Y) = 3: any consumer fewer than 3 instructions behind the producer hits a RAW hazard. WAR and WAW never fire in-order — they only become real once OoO or multi-cycle FUs let writes finish out of program order.

With full forwarding, the residual penalties on a 5-stage pipe are: (1) a load-use RAW costs exactly 1 stall cycle because the loaded value isn’t ready until end of MEM, and (2) a taken branch resolved in EX squashes the 2 instructions behind it under static “predict not-taken” — fixed by branch prediction or speculate-and-squash with a real predictor.

# Out-of-Order Execution

scoreboard — CDC 6600 — no renaming; stalls on WAW at D, waits on WAR at W.
Tomasulo — RS + CDB + value-copy renaming kills WAR/WAW; only RAW remains.
reservation station — Per-FU buffer holding op + tags T1/T2 + value copies V1/V2 until ready.
CDB — Single broadcast wire (tag, value); every RS CAM-matches to snarf operands.
register renaming — Map architectural names to physical locations — eliminates WAR/WAW by construction.
ROB — FIFO of in-flight insns in program order; in-order retire gives precise state.
precise interrupts — Architectural state at fault = state after committing all older, none younger.
speculation — Execute past unresolved branches/loads; squash via ROB on misprediction.

Ideal ILP with infinite resources

\text{ILP} = \frac{N_{\text{insns}}}{\text{critical dependence-chain depth}}

After renaming, only RAW edges count. Scoreboard counts RAW + WAW + WAR; Tomasulo counts RAW only — that gap is exactly why Tomasulo wins (e.g., 7/4 = 1.75 vs 7/3 = 2.33 on the canonical midterm trace).

Tomasulo + ROB pipeline: in-order Dispatch and Retire bookend out-of-order Issue / Execute / Complete. CDB delivers each FU result tagged with its producer ID to every RS, ROB slot, and the regfile.

Hazard	In-order	Scoreboard	Tomasulo / P6 / R10K
RAW	stall EX	wait at S	wait at S (CDB snarf)
WAW	n/a	stall at D	gone (renaming)
WAR	n/a	wait at W	gone (value-copy / PRF)
structural	stall	stall at D (no FU/FUST)	stall at D (no RS/ROB/PR)

RAW is the only data hazard that survives register renaming — it is a real producer→consumer edge, not a name conflict.

Both use Tomasulo-style RS plus a ROB for precise state — they differ in where values live.

P6 (value-based): tag = ROB#. Operand values are copied ARF/ROB → RS at dispatch, FU → ROB at complete, ROB → ARF at retire. Recovery is free (flush everything); fast clocking is hard (lots of value movement).
R10K (true renaming): tag = PR#. One unified physical register file; ROB holds only T and Told. CDB carries tags, never values. Fast clocking is easy; recovery requires serial rollback (or branch checkpoints) using Told to restore the map table. Free a physical register only when the next writer of the same architectural register retires.

Registers are not the only state — memory must be precise too. The store queue holds completed-but-unretired stores; the D-cache only sees retired stores in program order.

Conservative (P6): loads wait until every older store address is known, then either forward from SQ (youngest matching older store wins) or read the cache.
Opportunistic / speculative: loads execute past ambiguous-address stores. A load queue records load addresses so a later store-address resolve can detect a memory ordering violation and trigger a flush.
Hybrid (Store-Sets, Alpha 21264): predict which loads alias which stores; wait conservatively only for those, speculate on the rest.

Empirically <10% of loads forward, so opportunistic + a small predictor wins on average.

# Branch Prediction

BTB — PC-indexed cache of (predicted target, is-branch?). Handles direct branches and most indirect calls; partial tags OK because mispredicts are recoverable.
BHT — PC-indexed table of direction bits (or 2-bit counters). No tags — aliasing is just another mispredict.
2-bit counter — Saturating 4-state FSM (SN/WN/WT/ST). Predict taken iff high bit = 1; needs two consecutive wrongs to flip polarity (hysteresis kills the inner-loop pathology of a 1-bit BHT).
RAS — Tiny circular stack: push PC+4 on call, pop on return. The only structure that handles returns well — BTB is hopeless because each return PC sees many targets.
BHR — k-bit shift register of recent T/N outcomes. Global (one shared) captures cross-branch correlation; per-PC (local) captures self-correlation. Must be checkpointed and rolled back on mispredict.
gshare — GBHR XOR PC indexes a shared PHT of 2-bit counters. Cheap approximation of per-PC PHTs; basis of the Alpha 21264 hybrid and most modern descendants.
hybrid (tournament) — Two predictors run in parallel + a meta-predictor (per-PC 2-bit counter) picks the more accurate one for each branch. McFarling/Alpha 21264 reaches 90–95% direction accuracy.

Mispredict CPI overhead (iron law)

\text{CPI} = \text{CPI}_{\text{ideal}} + f_{\text{br}} \cdot (1 - a) \cdot P

f_br = branch frequency (~0.2 in integer code); a = predictor accuracy; P = pipeline flush penalty in cycles (\approx pipeline depth). At f_br=0.2, a=0.95, P=10: \Delta CPI = 0.2 \cdot 0.05 \cdot 10 = 0.10 — a 10% penalty on top of CPI_ideal. Multiplicative trap: 0.95^3 \approx 0.857 across three branches in one wide fetch group.

gshare PHT index

\text{idx} = (\text{PC}[k+1{:}2] \oplus \text{BHR}[k-1{:}0]) \bmod 2^k

k = log_2 of PHT size = BHR width. XOR (not concat) keeps the index k bits wide while still distinguishing PCs that share recent global history. Drop the 2 low PC bits (instruction-aligned) before XORing.

[[concept:two-bit-counter|2-bit saturating counter]] FSM. Predict taken iff state in {WT, ST} (high bit = 1). One wrong outcome only nudges the counter; a second consecutive wrong is needed to flip the prediction.

Local (per-PC) BHR captures self-correlation — perfect for tight inner loops whose outcome depends only on their own recent iterations (e.g. the for(j=0;j<3;j++) example: 3-bit local history reaches 100% steady-state accuracy after warming up the active patterns). Global BHR captures cross-branch correlation — needed when one branch’s outcome implies another’s (if(aa==2) aa=0; if(bb==2) bb=0; if(aa!=bb)). gshare gets most of both at GAg cost. Yeh-Patt’s {G,P}A{g,p,s} taxonomy: first letter = BHR partitioning, last letter = PHT partitioning.

Big accurate predictors (TAGE, hybrid) take 2–3 cycles, but fetch needs an answer every cycle. Solution: an overriding predictor uses a single-cycle bimodal/BTB for the initial prediction and squashes wrong-path insns when the slow predictor disagrees. Fast recovery itself uses a branch stack (snapshot of map table, free list, ROB/LSQ tails, BHR) plus a per-instruction b-mask — on a mispredict, every RS/FU entry whose mask bit is set for the offending branch is flash-squashed in one cycle, not by walking the ROB. Branches can resolve out-of-order; nested mispredicts work because each branch owns its own mask bit.

# Memory Hierarchy

cache block — unit of transfer between levels (typ. 64 B). Address splits into tag | index | block-offset; bigger blocks help spatial locality but raise conflict/capacity misses.
associativity — ways per set. 1-way = direct-mapped, fully-associative = 1 set. Targets conflict misses; returns saturate near 4–8-way.
LRU — evicts oldest-touched way. Cheap for 2-way; NMRU/random/PLRU for higher associativity.
write-back vs write-through — WB: dirty bit, only update next level on eviction (low bandwidth). WT: every store propagates (simple coherence).
victim cache — small fully-associative buffer next to a low-associativity L1. Catches conflict-evicted lines, even 4–8 entries help.
non-blocking cache + MSHRs — service hits/new misses while earlier misses pend. # MSHRs upper-bounds memory-level parallelism.
critical-word first — deliver requested word as soon as it arrives, refill rest in background — cuts effective miss penalty.
inclusion — L2 ⊇ L1 so external snoops only check L2; requires back-invalidation on L2 evictions.

virtual memory — per-process private address space + demand paging, both built on address translation checked on every reference.
TLB — cache of recent VPN→PPN translations + permission bits. Hit ≤ 1 pclk; miss triggers page-table walk.
multi-level page table — VPN split per level so unused subtrees cost nothing. x86-64: 4 or 5 levels of 9-bit fields, 8-byte PTEs, 4 KB pages.
VIPT vs PIPT — VIPT: index from page-offset bits → TLB and cache fire in parallel; max safe size = page × associativity. PIPT: TLB serial — slower hit but no aliasing.
page fault — translation missing or page swapped to disk; OS handles in software, costs ~10⁷ cycles.
prefetching — speculative fetch ahead of demand. Targets compulsory/capacity/coherence misses; never conflict.

AMAT (Average Memory Access Time)

\text{AMAT} = t_{hit} + m \cdot t_{miss}

m = miss rate (in [0,1]); t_miss is the miss penalty. Multi-level: replace t_miss with the next level’s AMAT, e.g. AMAT = t_{L1} + m_{L1},(t_{L2} + m_{L2},t_{mem}). Beware local vs global miss-rate: local = misses_{Li} / accesses_{Li}, global = misses_{Li} / total CPU refs.

Cache geometry & address slicing

C = a \cdot b \cdot S \;\;\Longrightarrow\;\; S = \frac{C}{a \cdot b}, \quad |\text{offset}| = \log_2 b, \;\; |\text{index}| = \log_2 S, \;\; |\text{tag}| = |\text{addr}| - |\text{index}| - |\text{offset}|

C = capacity (bytes), a = associativity (ways), b = block size (bytes), S = #sets. Direct-mapped: a=1. Fully-associative: S=1, no index. Example: 32 KB, 8-way, 64 B → S = 32768/(8·64) = 64 sets → 6 index bits, 6 offset bits.

Page-table & translation arithmetic

|\text{VPN}| = |\text{VA}| - \log_2(\text{page size}), \quad \text{flat PT size} = 2^{|\text{VPN}|} \cdot |\text{PTE}|, \quad t_{\text{walk}} \approx L \cdot t_{mem}

Page offset = log2(page size); same in VA and PA. Flat single-level PT for 64-bit VA with 4 KB pages and 8 B PTEs is 2^{52}·8 ≈ 32 PB — why we use multi-level or inverted tables. L = #levels of page table; each level is a memory access on a TLB miss (e.g. x86-64: 4 levels ≈ 4 dependent loads).

a-way set-associative [[concept:cache-block|cache]] lookup. Index selects one set across all ways; a parallel tag comparators check each way's stored tag against the incoming tag; the way-select MUX delivers the matching way's data, narrowed to the requested word by the block-offset.

[[concept:address-translation|Address-translation]] flow. Fast path: [[concept:tlb|TLB]] hit + permission check ≈ 1 cycle. TLB miss → hardware/SW walk of the [[concept:page-table|page table]] (one memory access per level). True [[concept:page-fault|page fault]] escalates to the OS at ~10⁷ pclk before refilling the TLB.

Mark Hill’s miss taxonomy — each C is defined by the experiment that eliminates it: compulsory (cold first-touch; survives an infinite cache) — attack with bigger blocks or prefetching; capacity (working set > cache; survives in fully-associative of same size) — attack with bigger $C$ or better replacement; conflict (residual after capacity is removed; only in set-associative or direct-mapped) — attack with higher associativity, victim cache, or skewed associativity; coherence (another core invalidated the line) — only shows up in multiprocessors. Standard write-policy pairings: write-back + write-allocate (modern L1/L2 — coalesce repeated stores, hit-rate-friendly); write-through + no-write-allocate (simple coherence, low complexity, but bandwidth scales with store fraction $f_{w}$ regardless of miss rate).

Hardware prefetcher families, in increasing pattern complexity: next-line — on miss for line X also fetch X+1; trivial, dominates I-cache, weak on D-cache. Stride — per-PC Reference Prediction Table records last address + last stride; on a confirmed stride match prefetch addr + s, addr + 2s, …; captures non-unit/negative strides through arrays. Stream buffer (Jouppi) — FIFO of sequential lines beyond a detected stream sitting next to the cache, so wrong streams cannot pollute L1; lookup checks FIFO heads. Address-correlating / delta-correlation — Markov-style table or GHB-walked linked list keyed by recent address(es) or deltas, predicting the next address(es) from history; covers irregular-but-repetitive sequences. Spatial (SMS) — record per-region offset bitmap keyed by triggering load PC, replay on next region. Linked-data (Roth) — learn load-to-load value dependences (p = p->next) and run an FSM ahead through the structure.

Cache and VM arithmetic is the #1 site of unit slips. Always convert block size to bytes before slicing the address — a 64 B block = 6 offset bits, not 64 bits = 6 (a real exam trap). Same drill for capacity: $C$ in bytes, divide by (associativity · block_bytes) to get #sets, then $\log_2$ that for index bits. For VM problems, the page offset is log2(page size in bytes) and is identical in VA and PA (only the page-number bits get translated). Sanity check: $|\text{tag}| + |\text{index}| + |\text{offset}| = |\text{address}|$ must hold exactly; if it doesn’t, you’ve miscounted somewhere. Off-by-one on $\log_2$ (e.g., 1 KB = $2^{10}$ , not $2^{11}$ ) is the other common loss.

# Multiprocessors

MSI — three states {Modified, Shared, Invalid}; M = exclusive & dirty (memory stale), S = clean read-only (memory current), I = no copy.
MESI — MSI + Exclusive: clean line held by exactly one cache; lets a read-then-write avoid the BusInv since no other sharer exists.
snooping — shared bus is the serialization point; every cache snoops every transaction. Simple; bus saturates beyond ~16 cores.
directory — home-node directory tracks sharers per line; point-to-point invalidates only to caches that actually share. Scales to many-core; longer hit latency.
write-back — the Modified state’s whole point — repeated stores to a private line are local hits; BusWB only on eviction.
write-through — every store hits the bus (e.g. trivial valid/invalid protocol). Simple but bandwidth-hostile.

MSI state machine. Edge labels are processor-action / bus-message; a dash means silent (no bus traffic). MESI adds an Exclusive state next to Shared so a clean-but-only sharer can upgrade to Modified without a BusInv. Memorize: Shared implies memory is clean; Modified implies memory is stale; M -> S forces memory to snarf on the BusReply.

lock — mutual exclusion; built atop hardware atomics (test-and-set, CAS, LL/SC). Spin under low contention, block under high — hybrid for variable workloads.
barrier — all N threads must arrive before any proceeds; foundational to BSP / SPMD phases. Cost grows with N; centralized counters cause cache-line ping-pong.
transactional memory — optimistic concurrency: speculate through the critical section and roll back on conflict. Wins when contention is rare and critical sections are short.
shared memory — single global address space; communication is implicit via loads/stores; needs hardware coherence + a memory model.
message passing — private address spaces with explicit send/recv (MPI). No coherence needed; scales to clusters.

Cache coherence is the contract for one address: every cache eventually agrees on its value, and writes to that address are seen in some single global order. The memory consistency model is the contract across addresses: the order in which one thread’s loads/stores to different locations become visible to other threads. Sequential consistency (Lamport) demands one global interleaving of all ops in program order. Real hardware is weaker — x86 TSO lets a store sit in the store buffer past a younger load; ARM/RISC-V use release consistency where ordering is only guaranteed across explicit fences/acquire-release. Coherence is automatic; consistency requires fences and atomics from the programmer.

Snooping hits two scaling walls: bus bandwidth and per-cache snoop bandwidth (most snoops are ‘I don’t have it’). Directory protocols fix both — replace the bus with a point-to-point on-chip network (mesh/ring) and the broadcast with a per-line sharer list at the home node. Cost: an extra network round-trip on every miss (requestor -> home -> sharers -> requestor) and storage for the sharer vector. Used in any modern 16+ core CMP, NUMA SMPs, and SGI-style machines. Snooping still wins for small core counts because of its simplicity and lack of indirection.

MESI is a write-invalidate protocol: a writer sends BusRdX/BusInv to invalidate every other sharer, then writes locally in Modified. Write-update (e.g. Dragon, Firefly) instead broadcasts the new value so other sharers stay valid — better when many readers consume each write, worse for migratory data. Real hardware almost universally chose invalidate. False sharing is the pathology you must recognize: two threads write different fields that happen to live in the same cache block. The block ping-pongs M -> I -> M between caches on every store even though the threads share no data. Fix by padding hot fields to a cache-line boundary.

# Exam tips

Every exam in the corpus has at least one question that hinges on this: register renaming (and therefore Tomasulo / P6 / R10k) eliminates only WAW and WAR name dependences. RAW is a true data-flow edge — the consumer must wait for the producer’s value, full stop. When you compute ILP under Tomasulo, drop only WAW/WAR edges from the dependence graph; the longest remaining RAW chain sets the bound.

On the T,T,T,T,T,N,N pattern in the Spring 2024 final, the 2-bit saturating counter reaches weakly not taken by the end of the period and mispredicts the first T of the next period — accuracy is 4/7, not 5/7. Whenever you analyze a periodic pattern, walk one full period and check the predictor’s state at the boundary; the misprediction at the wrap-around is the one students forget.