Skip to content
EECS 4340 Final Review

EECS 4340 — Final cheat sheet

# Introduction & Performance

Performance vocabulary
  • Latency vs. throughput — Latency = time per fixed task; throughput = tasks per unit time. Throughput exploits parallelism, latency cannot. Match the metric to the question.
  • CPI — Average cycles per instruction. The microarchitect’s lever — pipelining, OoO, caches all move CPI.
  • IPC — Reciprocal of CPI; higher is better. Used when reasoning about wide-issue/superscalar throughput.
  • MIPS — Frequency / CPI / 10^6. Partial metric — a smarter compiler can lower MIPS while making the program faster, so never compare across ISAs/compilers with MIPS.
  • Work T1 vs. critical path T_inf — Average parallelism = T1/T_inf. A p-wide machine takes at least max(T1/p, T_inf); critical path is the floor that adding cores cannot lower.
Iron law
TimeProgram=InstructionsProgram×CyclesInstruction×TimeCycle\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}

IC * CPI * T_cyc. Compiler owns IC, microarchitect owns CPI, circuit designer owns T_cyc. Every architectural change is a tradeoff among these three.

Amdahl's Law
Soverall=1(1f)+f/SS_{\text{overall}} = \frac{1}{(1-f) + f/S}

f = fraction of time accelerated, S = local speedup on that fraction. Ceiling is 1/(1-f) as S -> infinity, so a 90%-parallel program caps at 10x speedup no matter how many cores you throw at it.

Dynamic CMOS power
Pdyn12CV2AfP_{\text{dyn}} \sim \tfrac{1}{2} C V^2 A f

C = switched capacitance, V = supply voltage (squared!), A = activity factor (target of clock gating), f = clock frequency. Under DVFS V tracks f, so dynamic power scales roughly as V^3 (or f^3).

Pick the right mean

Three averaging techniques from arithmetic, harmonic, and geometric means: arithmetic for quantities proportional to time (latencies, cycles); harmonic for rates (throughput, MIPS, bandwidth) — equals the arithmetic mean of the underlying times; geometric for unit-less ratios (speedups normalized to a baseline). The naive trap: averaging two speeds (e.g. 30 mph + 90 mph over equal distance) with the arithmetic mean gives 60, but the true average is the harmonic mean = 45.

Power vs. energy

Power (W) sets instantaneous limits — heatsink size, DVFS P-state, package TDP. Energy (J) = power x time, sets cumulative limits — battery life, datacenter electricity bill. A slower lower-watt design can use more total energy if leakage keeps flowing during the longer runtime (motivates ‘race-to-halt’). Combined metrics: EDP = Et penalizes slow designs; ED^2P = Et^2 is approximately voltage-invariant under DVFS, so it ranks designs fairly regardless of where the V/f knob is set.

# Basic instructions

R-type ALU
MnemonicFormActionStages
addadd rd, rs, rtrd ← rs + rtIF · ID · EX · WB
subsub rd, rs, rtrd ← rs − rtIF · ID · EX · WB
nandnand rd, rs, rtrd ← ¬(rs ∧ rt)IF · ID · EX · WB
nornor rd, rs, rtrd ← ¬(rs ∨ rt)IF · ID · EX · WB
xorxor rd, rs, rtrd ← rs ⊕ rtIF · ID · EX · WB
sltslt rd, rs, rtrd ← (rs < rt) ? 1 : 0IF · ID · EX · WB
sllsll rd, rt, shrd ← rt ≪ shIF · ID · EX · WB
  • One-cycle ALU op in EX, writes rd in WB. No MEM access.
  • Adjacent producer→consumer chains are covered by EX→EX forwarding, so a RAW between two R-type instructions costs zero stalls.
Immediate ALU
MnemonicFormActionStages
addiaddi rt, rs, immrt ← rs + sext(imm)IF · ID · EX · WB
  • Same datapath as R-type add; the ALU’s second input is the sign-extended offset instead of rt.
  • The most common immediate form on the exams — used for loop counters, address arithmetic, and as the canonical “NOP-replacement” producer in hazard traces.
Memory
MnemonicFormActionStages
lwlw rt, off(rs)rt ← M[rs + sext(off)]IF · ID · EX · MEM · WB
swsw rt, off(rs)M[rs + sext(off)] ← rtIF · ID · EX · MEM
  • EX computes the effective address; MEM does the data-cache access. sw does not write a register, so it skips WB.
  • Load-use RAW survives forwarding — the load’s data is only available out of MEM, so a dependent instruction in the next slot costs one stall cycle (1 CPI penalty). Schedule one independent instruction between lw and its consumer to reclaim it.
Control / branch+jump
MnemonicFormActionStages
beqbeq rs, rt, offif rs == rt: PC ← PC + 1 + sext(off)IF · ID · EX
jj targetPC ← targetIF · ID
jaljal target$ra ← PC + 1; PC ← targetIF · ID · WB
  • Branches resolve in EX (the comparator sits next to the ALU). Under static predict-not-taken, a taken branch squashes the two instructions already in IF and ID — a 2-cycle CPI penalty per misprediction.
  • jal writes the return PC into $ra (R31) so a callee can jr $ra to return; treat it as an unconditional jump that also produces a register write in WB.
  • Any unresolved branch fetched past is speculative until EX confirms the direction; smarter predictors (covered in the branch-prediction section) reduce the squash rate but don’t change the resolution stage.
Multi-cycle
MnemonicFormActionStages
mulmul rd, rs, rtrd ← rs × rtIF · ID · EX (multi-cycle FU) · WB
divdiv rd, rs, rtrd ← rs ÷ rtIF · ID · EX (multi-cycle FU) · WB
  • These do not fit the single-cycle EX of the standard 5-stage — they execute on a multi-cycle (or pipelined) functional unit alongside the ALU. Latency is several cycles (mul typically 3–5, div 10–40) and is not always pipelined.
  • Variable EX latency is where in-order pipelines start to expose WAW and out-of-order completion problems — the trigger for scoreboarding/Tomasulo in later lectures.

# Pipelining

Hazards & resolution
  • RAW — true (flow) dependence — survives renaming; the only register hazard that fires in an in-order pipeline.
  • WAR — anti-dependence — name-only; eliminated by register renaming.
  • WAW — output dependence — name-only; also eliminated by renaming.
  • structural — two instructions need the same hardware port in one cycle (e.g. unified mem).
  • control — branch direction/target unknown when next fetch must occur.
  • forwarding — EX/Mem and Mem/WB bypass paths kill most RAW stalls; one load-use stall remains.
k-stage pipeline speedup
S=nkn+k1S = \frac{n \cdot k}{n + k - 1}

n = instruction count, k = pipeline depth. Time for n insns is (n + k - 1) cycles vs. n*k unpipelined; S \to k as n \gg k.

Load-use CPI (forwarded in-order pipe)
CPI=1+floadfdep1\text{CPI} = 1 + f_{load} \cdot f_{dep} \cdot 1

After forwarding, the only data stall is a load immediately followed by a dependent instruction (1-cycle bubble). f_load is the load fraction; f_dep is the fraction of insns dependent on the immediately preceding one. Add control-hazard penalties if branches aren’t predicted.

Iron law
Time=IC×CPI×Tcyc\text{Time} = \text{IC} \times \text{CPI} \times T_{cyc}

Pipelining attacks T_{cyc} (shorter stages) while keeping CPI near 1 — but stalls and squashes inflate CPI, and deeper pipes amplify mispredict cost.

EX/Mem fwd

Mem/WB fwd

regfile write

IF Fetch

ID Decode

EX Execute

MEM Memory

WB Writeback

Classic 5-stage pipeline (IF | ID | EX | MEM | WB) with the two forwarding paths (EX/Mem -> ALU and Mem/WB -> ALU) and the WB -> ID register-file write that closes the loop.
Hazard distance rule

A data hazard fires when the consumer’s read stage (X) reaches its register before the producer’s write stage (Y) has finished. In the in-order 5-stage pipe, read = ID (2) and write = WB (5), so dist(X, Y) = 3: any consumer fewer than 3 instructions behind the producer hits a RAW hazard. WAR and WAW never fire in-order — they only become real once OoO or multi-cycle FUs let writes finish out of program order.

Load-use & control penalties

With full forwarding, the residual penalties on a 5-stage pipe are: (1) a load-use RAW costs exactly 1 stall cycle because the loaded value isn’t ready until end of MEM, and (2) a taken branch resolved in EX squashes the 2 instructions behind it under static “predict not-taken” — fixed by branch prediction or speculate-and-squash with a real predictor.

# Out-of-Order Execution

Core machinery
  • scoreboard — CDC 6600 — no renaming; stalls on WAW at D, waits on WAR at W.
  • Tomasulo — RS + CDB + value-copy renaming kills WAR/WAW; only RAW remains.
  • reservation station — Per-FU buffer holding op + tags T1/T2 + value copies V1/V2 until ready.
  • CDB — Single broadcast wire (tag, value); every RS CAM-matches to snarf operands.
  • register renaming — Map architectural names to physical locations — eliminates WAR/WAW by construction.
  • ROB — FIFO of in-flight insns in program order; in-order retire gives precise state.
  • precise interrupts — Architectural state at fault = state after committing all older, none younger.
  • speculation — Execute past unresolved branches/loads; squash via ROB on misprediction.
Ideal ILP with infinite resources
ILP=Ninsnscritical dependence-chain depth\text{ILP} = \frac{N_{\text{insns}}}{\text{critical dependence-chain depth}}

After renaming, only RAW edges count. Scoreboard counts RAW + WAW + WAR; Tomasulo counts RAW only — that gap is exactly why Tomasulo wins (e.g., 7/4 = 1.75 vs 7/3 = 2.33 on the canonical midterm trace).

allocate RS, ROB, LSQ

tags + Told

ready

tag,value

in-order Retire

Fetch

Dispatch / Rename

Reservation Stations

ROB - in-order

Functional Units

CDB broadcast

Arch RegFile / D-cache

LSQ: SQ + LQ

D-cache

Tomasulo + ROB pipeline: in-order Dispatch and Retire bookend out-of-order Issue / Execute / Complete. CDB delivers each FU result tagged with its producer ID to every RS, ROB slot, and the regfile.
Hazards under each scheme
HazardIn-orderScoreboardTomasulo / P6 / R10K
RAWstall EXwait at Swait at S (CDB snarf)
WAWn/astall at Dgone (renaming)
WARn/await at Wgone (value-copy / PRF)
structuralstallstall at D (no FU/FUST)stall at D (no RS/ROB/PR)

RAW is the only data hazard that survives register renaming — it is a real producer→consumer edge, not a name conflict.

P6 vs. R10K renaming

Both use Tomasulo-style RS plus a ROB for precise state — they differ in where values live.

  • P6 (value-based): tag = ROB#. Operand values are copied ARF/ROB → RS at dispatch, FU → ROB at complete, ROB → ARF at retire. Recovery is free (flush everything); fast clocking is hard (lots of value movement).
  • R10K (true renaming): tag = PR#. One unified physical register file; ROB holds only T and Told. CDB carries tags, never values. Fast clocking is easy; recovery requires serial rollback (or branch checkpoints) using Told to restore the map table. Free a physical register only when the next writer of the same architectural register retires.
Memory ordering: SQ + LQ

Registers are not the only state — memory must be precise too. The store queue holds completed-but-unretired stores; the D-cache only sees retired stores in program order.

  • Conservative (P6): loads wait until every older store address is known, then either forward from SQ (youngest matching older store wins) or read the cache.
  • Opportunistic / speculative: loads execute past ambiguous-address stores. A load queue records load addresses so a later store-address resolve can detect a memory ordering violation and trigger a flush.
  • Hybrid (Store-Sets, Alpha 21264): predict which loads alias which stores; wait conservatively only for those, speculate on the rest.

Empirically <10% of loads forward, so opportunistic + a small predictor wins on average.

# Branch Prediction

Front-end predictor structures
  • BTB — PC-indexed cache of (predicted target, is-branch?). Handles direct branches and most indirect calls; partial tags OK because mispredicts are recoverable.
  • BHT — PC-indexed table of direction bits (or 2-bit counters). No tags — aliasing is just another mispredict.
  • 2-bit counter — Saturating 4-state FSM (SN/WN/WT/ST). Predict taken iff high bit = 1; needs two consecutive wrongs to flip polarity (hysteresis kills the inner-loop pathology of a 1-bit BHT).
  • RAS — Tiny circular stack: push PC+4 on call, pop on return. The only structure that handles returns well — BTB is hopeless because each return PC sees many targets.
  • BHR — k-bit shift register of recent T/N outcomes. Global (one shared) captures cross-branch correlation; per-PC (local) captures self-correlation. Must be checkpointed and rolled back on mispredict.
  • gshare — GBHR XOR PC indexes a shared PHT of 2-bit counters. Cheap approximation of per-PC PHTs; basis of the Alpha 21264 hybrid and most modern descendants.
  • hybrid (tournament) — Two predictors run in parallel + a meta-predictor (per-PC 2-bit counter) picks the more accurate one for each branch. McFarling/Alpha 21264 reaches 90–95% direction accuracy.
Mispredict CPI overhead (iron law)
CPI=CPIideal+fbr(1a)P\text{CPI} = \text{CPI}_{\text{ideal}} + f_{\text{br}} \cdot (1 - a) \cdot P

f_br = branch frequency (~0.2 in integer code); a = predictor accuracy; P = pipeline flush penalty in cycles (\approx pipeline depth). At f_br=0.2, a=0.95, P=10: \Delta CPI = 0.2 \cdot 0.05 \cdot 10 = 0.10 — a 10% penalty on top of CPI_ideal. Multiplicative trap: 0.95^3 \approx 0.857 across three branches in one wide fetch group.

gshare PHT index
idx=(PC[k+1:2]BHR[k1:0])mod2k\text{idx} = (\text{PC}[k+1{:}2] \oplus \text{BHR}[k-1{:}0]) \bmod 2^k

k = log_2 of PHT size = BHR width. XOR (not concat) keeps the index k bits wide while still distinguishing PCs that share recent global history. Drop the 2 low PC bits (instruction-aligned) before XORing.

N

T

N

T

N

T

N

T

SN 00\nstrong N

WN 01\nweak N

WT 10\nweak T

ST 11\nstrong T

[[concept:two-bit-counter|2-bit saturating counter]] FSM. Predict taken iff state in {WT, ST} (high bit = 1). One wrong outcome only nudges the counter; a second consecutive wrong is needed to flip the prediction.
Local vs global history

Local (per-PC) BHR captures self-correlation — perfect for tight inner loops whose outcome depends only on their own recent iterations (e.g. the for(j=0;j<3;j++) example: 3-bit local history reaches 100% steady-state accuracy after warming up the active patterns). Global BHR captures cross-branch correlation — needed when one branch’s outcome implies another’s (if(aa==2) aa=0; if(bb==2) bb=0; if(aa!=bb)). gshare gets most of both at GAg cost. Yeh-Patt’s {G,P}A{g,p,s} taxonomy: first letter = BHR partitioning, last letter = PHT partitioning.

Mispredict recovery & overriding predictors

Big accurate predictors (TAGE, hybrid) take 2–3 cycles, but fetch needs an answer every cycle. Solution: an overriding predictor uses a single-cycle bimodal/BTB for the initial prediction and squashes wrong-path insns when the slow predictor disagrees. Fast recovery itself uses a branch stack (snapshot of map table, free list, ROB/LSQ tails, BHR) plus a per-instruction b-mask — on a mispredict, every RS/FU entry whose mask bit is set for the offending branch is flash-squashed in one cycle, not by walking the ROB. Branches can resolve out-of-order; nested mispredicts work because each branch owns its own mask bit.

# Memory Hierarchy

Cache fundamentals
  • cache block — unit of transfer between levels (typ. 64 B). Address splits into tag | index | block-offset; bigger blocks help spatial locality but raise conflict/capacity misses.
  • associativity — ways per set. 1-way = direct-mapped, fully-associative = 1 set. Targets conflict misses; returns saturate near 4–8-way.
  • LRU — evicts oldest-touched way. Cheap for 2-way; NMRU/random/PLRU for higher associativity.
  • write-back vs write-through — WB: dirty bit, only update next level on eviction (low bandwidth). WT: every store propagates (simple coherence).
  • victim cache — small fully-associative buffer next to a low-associativity L1. Catches conflict-evicted lines, even 4–8 entries help.
  • non-blocking cache + MSHRs — service hits/new misses while earlier misses pend. # MSHRs upper-bounds memory-level parallelism.
  • critical-word first — deliver requested word as soon as it arrives, refill rest in background — cuts effective miss penalty.
  • inclusion — L2 ⊇ L1 so external snoops only check L2; requires back-invalidation on L2 evictions.
Virtual memory & prefetching
  • virtual memory — per-process private address space + demand paging, both built on address translation checked on every reference.
  • TLB — cache of recent VPN→PPN translations + permission bits. Hit ≤ 1 pclk; miss triggers page-table walk.
  • multi-level page table — VPN split per level so unused subtrees cost nothing. x86-64: 4 or 5 levels of 9-bit fields, 8-byte PTEs, 4 KB pages.
  • VIPT vs PIPT — VIPT: index from page-offset bits → TLB and cache fire in parallel; max safe size = page × associativity. PIPT: TLB serial — slower hit but no aliasing.
  • page fault — translation missing or page swapped to disk; OS handles in software, costs ~10⁷ cycles.
  • prefetching — speculative fetch ahead of demand. Targets compulsory/capacity/coherence misses; never conflict.
AMAT (Average Memory Access Time)
AMAT=thit+mtmiss\text{AMAT} = t_{hit} + m \cdot t_{miss}

m = miss rate (in [0,1]); t_miss is the miss penalty. Multi-level: replace t_miss with the next level’s AMAT, e.g. AMAT = t_{L1} + m_{L1},(t_{L2} + m_{L2},t_{mem}). Beware local vs global miss-rate: local = misses_{Li} / accesses_{Li}, global = misses_{Li} / total CPU refs.

Cache geometry & address slicing
C=abS        S=Cab,offset=log2b,    index=log2S,    tag=addrindexoffsetC = a \cdot b \cdot S \;\;\Longrightarrow\;\; S = \frac{C}{a \cdot b}, \quad |\text{offset}| = \log_2 b, \;\; |\text{index}| = \log_2 S, \;\; |\text{tag}| = |\text{addr}| - |\text{index}| - |\text{offset}|

C = capacity (bytes), a = associativity (ways), b = block size (bytes), S = #sets. Direct-mapped: a=1. Fully-associative: S=1, no index. Example: 32 KB, 8-way, 64 B → S = 32768/(8·64) = 64 sets → 6 index bits, 6 offset bits.

Page-table & translation arithmetic
VPN=VAlog2(page size),flat PT size=2VPNPTE,twalkLtmem|\text{VPN}| = |\text{VA}| - \log_2(\text{page size}), \quad \text{flat PT size} = 2^{|\text{VPN}|} \cdot |\text{PTE}|, \quad t_{\text{walk}} \approx L \cdot t_{mem}

Page offset = log2(page size); same in VA and PA. Flat single-level PT for 64-bit VA with 4 KB pages and 8 B PTEs is 2^{52}·8 ≈ 32 PB — why we use multi-level or inverted tables. L = #levels of page table; each level is a memory access on a TLB miss (e.g. x86-64: 4 levels ≈ 4 dependent loads).

Address bits

tag | index | block-offset

index = log2(#sets)
#sets = C / (a · b)

set i (one row across a ways)

way 0: tag | data

way 1: tag | data

way a-1: tag | data

a parallel tag comparators

hit = OR of (tag match AND valid)

way-select MUX

block-offset selects word in line

data to CPU

a-way set-associative [[concept:cache-block|cache]] lookup. Index selects one set across all ways; a parallel tag comparators check each way's stored tag against the incoming tag; the way-select MUX delivers the matching way's data, narrowed to the requested word by the block-offset.

hit

miss

PTE present

PTE absent / swapped

ok

denied

Effective / Virtual address

TLB lookup
≤ 1 pclk

Permission check

Page-table walk
~100s pclk

Refill TLB

Page fault: OS handler
~10^7 pclk

Physical address → cache

Protection fault

[[concept:address-translation|Address-translation]] flow. Fast path: [[concept:tlb|TLB]] hit + permission check ≈ 1 cycle. TLB miss → hardware/SW walk of the [[concept:page-table|page table]] (one memory access per level). True [[concept:page-fault|page fault]] escalates to the OS at ~10⁷ pclk before refilling the TLB.
4 Cs of misses + write-policy pairings

Mark Hill’s miss taxonomy — each C is defined by the experiment that eliminates it: compulsory (cold first-touch; survives an infinite cache) — attack with bigger blocks or prefetching; capacity (working set > cache; survives in fully-associative of same size) — attack with bigger CC or better replacement; conflict (residual after capacity is removed; only in set-associative or direct-mapped) — attack with higher associativity, victim cache, or skewed associativity; coherence (another core invalidated the line) — only shows up in multiprocessors. Standard write-policy pairings: write-back + write-allocate (modern L1/L2 — coalesce repeated stores, hit-rate-friendly); write-through + no-write-allocate (simple coherence, low complexity, but bandwidth scales with store fraction fwf_{w} regardless of miss rate).

Prefetcher classes

Hardware prefetcher families, in increasing pattern complexity: next-line — on miss for line X also fetch X+1; trivial, dominates I-cache, weak on D-cache. Stride — per-PC Reference Prediction Table records last address + last stride; on a confirmed stride match prefetch addr + s, addr + 2s, …; captures non-unit/negative strides through arrays. Stream buffer (Jouppi) — FIFO of sequential lines beyond a detected stream sitting next to the cache, so wrong streams cannot pollute L1; lookup checks FIFO heads. Address-correlating / delta-correlation — Markov-style table or GHB-walked linked list keyed by recent address(es) or deltas, predicting the next address(es) from history; covers irregular-but-repetitive sequences. Spatial (SMS) — record per-region offset bitmap keyed by triggering load PC, replay on next region. Linked-data (Roth) — learn load-to-load value dependences (p = p->next) and run an FSM ahead through the structure.

# Multiprocessors

Coherence states & protocols
  • MSI — three states {Modified, Shared, Invalid}; M = exclusive & dirty (memory stale), S = clean read-only (memory current), I = no copy.
  • MESI — MSI + Exclusive: clean line held by exactly one cache; lets a read-then-write avoid the BusInv since no other sharer exists.
  • snooping — shared bus is the serialization point; every cache snoops every transaction. Simple; bus saturates beyond ~16 cores.
  • directory — home-node directory tracks sharers per line; point-to-point invalidates only to caches that actually share. Scales to many-core; longer hit latency.
  • write-back — the Modified state’s whole point — repeated stores to a private line are local hits; BusWB only on eviction.
  • write-through — every store hits the bus (e.g. trivial valid/invalid protocol). Simple but bandwidth-hostile.

Load / BusRd

Store / BusRdX

Store / BusInv

Evict / —

snoop BusRdX or BusInv / [BusReply]

snoop BusRd / BusReply (memory snarfs)

Evict / BusWB

snoop BusRdX / BusReply

snoop BusRd / [BusReply]

Load, Store / —

I

S

M

MSI state machine. Edge labels are processor-action / bus-message; a dash means silent (no bus traffic). MESI adds an Exclusive state next to Shared so a clean-but-only sharer can upgrade to Modified without a BusInv. Memorize: Shared implies memory is clean; Modified implies memory is stale; M -> S forces memory to snarf on the BusReply.
Synchronization primitives
  • lock — mutual exclusion; built atop hardware atomics (test-and-set, CAS, LL/SC). Spin under low contention, block under high — hybrid for variable workloads.
  • barrier — all N threads must arrive before any proceeds; foundational to BSP / SPMD phases. Cost grows with N; centralized counters cause cache-line ping-pong.
  • transactional memory — optimistic concurrency: speculate through the critical section and roll back on conflict. Wins when contention is rare and critical sections are short.
  • shared memory — single global address space; communication is implicit via loads/stores; needs hardware coherence + a memory model.
  • message passing — private address spaces with explicit send/recv (MPI). No coherence needed; scales to clusters.
Coherence vs. consistency (the high-leverage distinction)

Cache coherence is the contract for one address: every cache eventually agrees on its value, and writes to that address are seen in some single global order. The memory consistency model is the contract across addresses: the order in which one thread’s loads/stores to different locations become visible to other threads. Sequential consistency (Lamport) demands one global interleaving of all ops in program order. Real hardware is weaker — x86 TSO lets a store sit in the store buffer past a younger load; ARM/RISC-V use release consistency where ordering is only guaranteed across explicit fences/acquire-release. Coherence is automatic; consistency requires fences and atomics from the programmer.

Snooping vs. directory

Snooping hits two scaling walls: bus bandwidth and per-cache snoop bandwidth (most snoops are ‘I don’t have it’). Directory protocols fix both — replace the bus with a point-to-point on-chip network (mesh/ring) and the broadcast with a per-line sharer list at the home node. Cost: an extra network round-trip on every miss (requestor -> home -> sharers -> requestor) and storage for the sharer vector. Used in any modern 16+ core CMP, NUMA SMPs, and SGI-style machines. Snooping still wins for small core counts because of its simplicity and lack of indirection.

False sharing & write-invalidate vs. write-update

MESI is a write-invalidate protocol: a writer sends BusRdX/BusInv to invalidate every other sharer, then writes locally in Modified. Write-update (e.g. Dragon, Firefly) instead broadcasts the new value so other sharers stay valid — better when many readers consume each write, worse for migratory data. Real hardware almost universally chose invalidate. False sharing is the pathology you must recognize: two threads write different fields that happen to live in the same cache block. The block ping-pongs M -> I -> M between caches on every store even though the threads share no data. Fix by padding hot fields to a cache-line boundary.

# Exam tips

Renaming kills WAR and WAW — never RAW

Every exam in the corpus has at least one question that hinges on this: register renaming (and therefore Tomasulo / P6 / R10k) eliminates only WAW and WAR name dependences. RAW is a true data-flow edge — the consumer must wait for the producer’s value, full stop. When you compute ILP under Tomasulo, drop only WAW/WAR edges from the dependence graph; the longest remaining RAW chain sets the bound.

2-bit predictor on a periodic pattern is bounded by run lengths

On the T,T,T,T,T,N,N pattern in the Spring 2024 final, the 2-bit saturating counter reaches weakly not taken by the end of the period and mispredicts the first T of the next period — accuracy is 4/7, not 5/7. Whenever you analyze a periodic pattern, walk one full period and check the predictor’s state at the boundary; the misprediction at the wrap-around is the one students forget.