CS184c: Computer Architecture Reading [Parallel and Multithreaded] - PDF document

CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short – ISCA papers Simultaneous Multi-Threading (SMT) • Good primary sources CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Today Threaded Abstract Machine • TAM • SMT CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon TAM TL0 Model • Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables – Synchronization • Hybrid Dataflow – Thread stack (continuation vectors) • Scheduling Hierarchy • Heap Storage – I-structures CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 1

TL0 Ops Scheduling Hierarchy • Intra-frame • RISC -like ALU Ops • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • Inter-frame • FALLOC • FFREE – Only swap when exhaust work in current • SWAP frame CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Intra-Frame Scheduling TL0/CM5 Intra-frame • Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst – Unsynch branch 3 inst • STOP pops next PC off stack – Successful synch 4 inst • Stack initialized with code to exit – Unsuccessful synch 8 inst activation frame • Push thread onto LCV 3-6 inst – Including schedule next frame – Save live registers CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Fib Example Multiprocessor Parallelism • [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing? CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 2

Frame Scheduling CM5 Frame Scheduling Costs • Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Cycle Instruction Mix Breakdown [Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Speedup Example Thread Stats • Thread lengths 3—17 • Threads run per “quantum” 7 —530 [Culler et. Al. JPDC, July 1993] [Culler et. Al. JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 3

Great Project • Develop optimized µ Arch for TAM Multithreaded Architectures – Hardware support/architecture for single- cycle thread-switch/post CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Problem Idea • Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput latency (utilization) problem – CPU sits idle CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon HEP/ µ Unity/Tera HEP Pipeline • Provide a number of contexts – Copies of register file… • Number of contexts ≥ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin [figure: Arvind+Innucci, DFVLR’87] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 4

Strict Interleaved Threading • Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Can we do both? SuperScalar Inefficiency • Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue Recall: limited slots with instructions from different Scalar IPC threads CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT Promise SMT Estimates (ideal) Fill in empty slots with other threads [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 5

SMT Estimates (ideal) SMT uArch • Observation: exploit register renaming – Get small modifications to existing superscalar architecture [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Stopped Here 4/24/01 • N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Performance • Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 6

Optimizing: fetch freedom Optimizing: Fetch Alg. • RR=Round Robin • ICOUNT – priority to thread w/ fewest • RR.X.Y pending instrs – X – threads do fetch in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Throughput Improvement Costs • 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Costs Not Done, yet… • Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead? [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 7

Thought? Big Ideas • SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 8

CS184c: Computer Architecture Reading [Parallel and Multithreaded] - PDF document

CS184c: Computer Architecture Reading [Parallel and Multithreaded] Shared Memory Focus: H&P Ch 8 At least read this Day 7: April 24, 2001 Retrospectives Threaded Abstract Machine (TAM) Valuable and short

CS184c: Computer Architecture [Parallel and Multithreaded] Day 1: April 3, 2001 Overview and

CS184c: Computer Architecture [Parallel and Multithreaded] Day 5: April 17, 2001 Network

CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel

CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed

CS184c: Computer Architecture [Parallel and Multithreaded] Day 16: May 31, 2001 Defect and

CS184c: Computer Architecture [Parallel and Multithreaded] Day 15: May 29, 2001 Interconnect

CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE CALTECH

CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization

CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing

CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing

CS184c: Computer Architecture [Parallel and Multithreaded] Day 2: April 5, 2001 Message

CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

The eXplicit MultiThreading (XMT) Parallel Computer Architecture Parallel Computer Architecture

Semantic Analysis of Sentences: The Estonian Experience Haldur im Heili Orav Neeme Kahusk

Problem Frames A Lecture Michael Jackson MIT 6898 06 March 2002

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

IEEE 802.1 Time-Sensitive Networking (TSN) Jnos Farkas, Norman Finn, Patricia Thaler Ericsson

Grammar Implementation with Lexicalized Tree Adjoining Grammars and Frame Semantics Frame

Efficient Models for Grasp Planning With A Object Model Finger Multi-fingered Hand Workspace

Environmental Communiucation 9/5/2017 Photogprahy Crash Course Composition 1 Work Composition:

Rigid Body Transformations (Or How Different sensors see the same world) By, Paritosh Kelkar