CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short – ISCA papers Simultaneous Multi-Threading (SMT) • Good primary sources CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Today Threaded Abstract Machine • TAM • SMT CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon TAM TL0 Model • Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables – Synchronization • Hybrid Dataflow – Thread stack (continuation vectors) • Scheduling Hierarchy • Heap Storage – I-structures CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 1
TL0 Ops Scheduling Hierarchy • Intra-frame • RISC -like ALU Ops • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • Inter-frame • FALLOC • FFREE – Only swap when exhaust work in current • SWAP frame CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Intra-Frame Scheduling TL0/CM5 Intra-frame • Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst – Unsynch branch 3 inst • STOP pops next PC off stack – Successful synch 4 inst • Stack initialized with code to exit – Unsuccessful synch 8 inst activation frame • Push thread onto LCV 3-6 inst – Including schedule next frame – Save live registers CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Fib Example Multiprocessor Parallelism • [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing? CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 2
Frame Scheduling CM5 Frame Scheduling Costs • Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Cycle Instruction Mix Breakdown [Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Speedup Example Thread Stats • Thread lengths 3—17 • Threads run per “quantum” 7 —530 [Culler et. Al. JPDC, July 1993] [Culler et. Al. JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 3
Great Project • Develop optimized µ Arch for TAM Multithreaded Architectures – Hardware support/architecture for single- cycle thread-switch/post CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Problem Idea • Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput latency (utilization) problem – CPU sits idle CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon HEP/ µ Unity/Tera HEP Pipeline • Provide a number of contexts – Copies of register file… • Number of contexts ≥ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin [figure: Arvind+Innucci, DFVLR’87] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 4
Strict Interleaved Threading • Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Can we do both? SuperScalar Inefficiency • Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue Recall: limited slots with instructions from different Scalar IPC threads CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT Promise SMT Estimates (ideal) Fill in empty slots with other threads [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 5
SMT Estimates (ideal) SMT uArch • Observation: exploit register renaming – Get small modifications to existing superscalar architecture [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Stopped Here 4/24/01 • N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Performance • Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 6
Optimizing: fetch freedom Optimizing: Fetch Alg. • RR=Round Robin • ICOUNT – priority to thread w/ fewest • RR.X.Y pending instrs – X – threads do fetch in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Throughput Improvement Costs • 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Costs Not Done, yet… • Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead? [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 7
Thought? Big Ideas • SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 8
Recommend
More recommend