9c.1 9c.2 Credits • Some of the material in this presentation is taken from: – Computer Architecture: A Quantitative Approach • John Hennessy & David Patterson EE 457 Unit 9c • Some of the material in this presentation is derived from course notes and slides from – Prof. Michel Dubois (USC) – Prof. Murali Annavaram (USC) Thread Level Parallelism – Prof. David Patterson (UC Berkeley) 9c.3 9c.4 Motivating HW Multithread/Multicore • Issues that prevent us from exploiting ILP in more advanced single-core processors with deeper pipelines and OoO Execution – ______________ hierarchy – Increased ___________ with _______ clock rates A Case for Thread-Level Parallelism – Increased ___________ with more advanced CHIP MULTITHREADING AND structures (ROBs, Issue queues, etc.) MULTIPROCESSORS
9c.5 9c.6 Memory Wall Problem The Problem with 5-Stage Pipeline • Processor performance is increasing much faster than memory • A cache miss (memory induced stall) causes computation to performance stall • A __________ in compute time yields only minimal overall speedup due to _________________ dominating compute 55%/year Processor-Memory Performance Gap Compute C Time Memory M Time Latency 7%/year Single-Thread C M C M C M Execution Actual program speedup is There is a limit to ILP! Single-Thread Hennessy and Patterson, C M C M C M minimal due to Execution If a cache miss requires several hundred clock cyles even OoO pipelines with Computer Architecture – _________ (w/ ________ in A Quantitative Approach (2003) 10's or 100's of in-flight instructions may stall. latency compute) Adapted from: OpenSparc T1 Micro-architecture Specification 9c.7 9c.8 Cache Hierarchy Cache Penalty Example • A hierarchy of cache can help mitigate • Assume an L1 hit rate of 95% and miss penalty of 20 P the cache miss penalty clock cycles (assuming these misses hit in L2). What is • L1 Cache the CPI for our typical 5 stage pipeline? L1 Cache – 64 KB – 95 instructions take ____ cycles to execute – 2 cycle access time – 5 instructions take _________ cycles to execute – Common Miss Rate ~ ___ L2 Cache – Total _____ cycles for 100 instructions = • L2 Cache CPI of ____ – 1 MB – Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles L3 Cache – 20 cycle access time – Common Miss Rate ~ ___ • Main Memory – _____ cycle access time Memory
9c.9 9c.10 Multithreading Case for Multithreading • By executing multiple threads we can keep the processor busy • Long latency events – Cache Miss, Exceptions, Lock (Synchronization), Long instructions such as with ________________ MUL/DIV • Swap to the next thread when the current thread hits a • Long latency events cause Io and even OoO pipelines to be underutilized ________________________________ Idea: Share the processor among two executing threads, switching when • one hits a ___________________ Compute C – Only penalty is flushing the pipeline. Time Memory M Time Latency Thread 1 Single Thread C M C M Cache Cache Cache Cache Miss Compute Compute Miss Compute Miss Compute Miss Thread 2 C M C M Thread 3 C M C M Two Threads Cache Cache Cache Cache Thread 4 C M C M Compute Miss Compute Miss Compute Miss Compute Miss Cache Compute Compute Cache Compute Cache Compute Adapted from: OpenSparc T1 Micro-architecture Specification Miss Miss Miss 9c.11 9c.12 Non-Blocking Caches Power • Power consumption decomposed into: • Cache can service hits while fetching one or – Static: Power constantly being dissipated (grows with # of transistors) more _________________ – Dynamic: Power consumed for switching a bit (1 to 0) 2 f • P DYN = I DYN *V DD ≈ ½C TOT V DD – Example: Pentium Pro has a non-blocking cache – Recall, I = C dV/dt capable of handling ______________ – V DD is the logic ‘1’ voltage, f = clock frequency • Dynamic power favors parallel processing vs. higher clock rates ___________ – V DD value is tied to f, so a reduction/increase in f leads to similar change in Vdd Implies power is proportional to f 3 (a cubic savings in power if we can reduce f) – – Take a core and replicate it 4x => 4x performance and ___________ – Take a core and increase clock rate 4x => 4x performance and ________ • Static power – Leakage occurs no matter what the frequency is
9c.13 9c.14 Temperature Wire Delay • Temperature is related to power consumption • In modern circuits wire delay (transmitting the signal) begins to _________________________ (time for gate to switch) – Locations on the chip that burn more power will usually run hotter • Locations where bits toggle (register file, etc.) often will become quite hot • As wires get longer especially if toggling continues for a long period of time – Resistance goes up and Capacitance goes up causing longer time – Too much heat can destroy a chip delays (time is proportional to R*C) – Can use sensors to dynamically sense temperature • Dynamically scheduled, OoO processors require • Techniques for controlling temperature ___________________ for buses, forwarding, etc. – External measures: Remove and spread the heat • Simpler pipelines often lead to _________________ signal • Heat sinks, fans, even liquid cooled machines connections (wires) – Architectural measures • Throttle performance (run at slower frequencies / lower voltages) • CMP is really the only viable choice • Global clock gating (pause..turn off the clock) • None…results can be catastrophic • http://www.tomshardware.com/2001/09/17/hot_spot/ 9c.15 9c.16 Software Multithreading • Used since 1960's to hide I/O latency CPU Regs – Multiple processes with different virtual address spaces and process control blocks PC – On an I/O operation, state is saved and another process is given to the CPU OS – When I/O operation completes the process is Scheduler rescheduled • On a context switch T1 = Ready T2 = Blocked T3 = Ready – Trap processor and flush pipeline – Save state in process control block (____________ Saved Saved Saved IMPLEMENTING MULTITHREADING State State State __________________________________) Regs Regs Regs – Restore state of another process AND MULTICORE – Start execution and fill pipeline PC PC PC • Very high overhead! Meta Meta Meta Data Data Data • Context switch is also triggered by ___________ _________________
9c.17 9c.18 Hardware Multithreading Typical CMP Organization • Run multiple threads in turn on the same core • Requires additional hardware for fast context Private L1's require maintaining ___________ via snooping. Chip Multi- Processor switching Sharing L1 is not a good idea. P P P P L2 is shared (1 copy of data) and – Multiple register files thus does not require a L1 L1 L1 L1 coherency mechanism. – Multiple state registers (condition codes, interrupt Interconnect (On-Chip Network) vector, etc.) L2 L2 L2 L2 – Avoids saving context manually (via software) Bank Bank/ Bank Bank/ Shared bus would be a bottleneck. Use switched network (multiple Main Memory simultaneous connections) 9c.19 9c.20 Sparc T1 Niagara Sun T1 "Niagara" Block Diagram • 8 cores each executing 4 threads called a thread group – Zero cycle thread switching penalty (round-robin) – 6 stage pipeline • Each core has its own L1 cache • Each thread has its own – Register file, instruction and store buffers • Threads share… – L1 cache, TLB, and execution units • 3 MB shared L2 Cache, 4-banks, 12-way set-associative – Is it a problem that it's not a power of 2? ______ 2005 Ex. of Fine-grained Multithreading (Thread) Fetch Select Decode Exec. Mem. WB http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
9c.21 9c.22 Sun T1 "Niagara" Pipeline T1 Pipeline • Fetch stage – Thread select mux chooses PC – Access I-TLB and I-Cache – 2 instructions fetched per cycle • Thread select stage – Choose instructions to issue from ready threads – Issues based on • Instruction type • Misses • Resource conflicts • Traps and interrupts http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf 9c.23 9c.24 T1 Pipeline Pipeline Scheduling • Decode stage • No pipeline flush on context switch (except on cache miss) – Accesses register file • Execute Stage • Full forwarding/bypassing to younger instructions of same thread – Includes ALU, shifter, MUL and DIV units – Forwarding Unit • In case of load, wait _________ before an instruction from the same thread is issued • Memory stage – Solved _________________ issue – DTLB, Data Cache, and 4 store buffers (1 per thread) • Scheduler guarantees fairness between threads by • WB prioritizing the least recently scheduled thread – Write to register file
Recommend
More recommend