17.1 17.2 Improving Performance • We want to improve the performance of our computation Unit 17 • Question: What are we referring to when we say "performance"? – __________________ Improving Performance – __________________ – __________________ Caching and Pipelining • We will primarily consider __________ in this discussion 17.3 17.4 How Do We Measure Speed Performance Depends on View Point?! • Fundamental Measurement : _________ • What's faster to get from point A to point B? – Absolute time from __________ to ___________ – A 747 Jumbo Airliner – To compare two alternative systems (HW + SW) and their – An F-22 supersonic, fighter jet performance, start a timer when you begin a task and stop it when the • If only _______________ to get from point A to point B, then task ends the ___________ – Do this for both systems and compare the resulting times – This is known as _______________ [units of seconds] • We call this the __________ of the system and it works great – Time from the start of an operation until it completes from the perspective of the _______________ task • If _______________ to get from point A to point B, the _____ – If system A completes the task in 2 seconds and system B requires 3 looks much better seconds, then system A is clearly superior – This is known as _______________ [jobs/second] • But when we dig deeper and realize that the single, overall • The overall execution time (latency) may best be improved by task is likely made of _________ small tasks, we can consider _______________ throughput and not the latency of more than just latency individual tasks
17.5 17.6 Hardware Techniques • We can add hardware or reorganize our hardware to improve throughput and latency of individual tasks in an effort to reduce the total latency (time) to finish the overall task • We will look at two examples: – Caching: Improves ______________ Improving Latency and Throughput – Pipelining: Improves ______________ CACHING AND PIPELINING 17.7 17.8 Caching Cache Overview • Remember what register are used for? • Cache (def.) – " to store away in hiding or for future use " Processor Chip – Quick access to copies of data • Primary idea Registers – Only a _______ (32 or 64) so that we can ALUs s0 ALUs – The ______________ you access or use something you expend the access really quickly PC 800a5 sf ________ amount of time to get it – Controlled by the __________________ – However, store it someplace (i.e. in a cache) you can get it more • Cache memory is a small-ish, (____bytes to Cache Memory ______________ the next time you need it a few _____bytes) "_________" memory – The next time you need something check if it is in the cache first usually built onto the processor chip – If it is in the cache, you can get it quickly; else go get it expending the • Will hold ____________ of the full amount of time (but then __________ it in the cache) latest data & instructions Bus accessed by the processor • Examples: 0x400000 • Managed by the ____ – _____________________ 0x400040 – ____________ to the software – _____________________ … – _____________________ Memory (RAM)
17.9 17.10 Cache Operation (1) Cache Operation (2) • When processor wants data or • When processor asks for the Processor Chip instructions it always data again or for the next data Registers Registers ALUs ALUs s0 s0 ALUs ALUs _________ in the cache first value in the array (or PC 800a5 sf PC 800a5 sf Proc. requests Proc. requests instruction of the code) the data @ 0x400028 1 Cache forwards 1 Cache has the 4 2 • If it is there, ______ access data @ 0x400028 desired data data & forwards again cache will likely have it it quickly Cache Memory • If not, get it from __________ Proc. requests 3 4 data @ 0x400024 • Questions? again Cache also • Memory will also supply Cache Cache does not have has the Memory the data and thus nearby data 2 ______________ data since it requests data from 3 memory is likely to be needed soon Memory responds not only with desired data Bus Bus but surrounding data • Why? 0x400000 0x400000 Main point: Caching reduces • Things like ______ & ______ 0x400040 0x400040 the latency of memory (instructions) are commonly accesses which improves … … accessed sequentially overall program performance. Memory (RAM) Memory (RAM) 17.11 17.12 Memory Hierarchy & Caching Pipelining • Use several levels of faster and faster memory to hide _______ • We'll now look at a hardware technique called of larger levels More pipelining to improve _______________ Smaller Faster Expensive Unit of Transfer: • The key idea is to __________ the processing 8- to 64- bits Registers of multiple "items" (either data or L1 Cache instructions) ~ 1ns L2 Cache Unit of Transfer: 8-64 bytes ~ 10ns Main Memory ~ 100 ns Less Slower Larger Expensive
17.13 17.14 Example Pipelining Example • Pipelining refers to insertion of registers to split • Suppose you are asked to build dedicated hardware to combinational logic into smaller stages that can be perform some operation on all 100 elements of some arrays overlapped in time (i.e. create an assembly line) • Suppose the operation (A[i]+B[i])/4 takes 10 ns to perform • How long would it take to process the entire arrays: ______ ns for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; – Can we improve? for(i=0; i < 100; i++) Clock Freq. = 1/__ns = Time for 0 th elements to _______ MHz C[i] = (A[i] + B[i]) / 4; complete: __________ (longest path from register to register) Time between each of the Memory remaining 99 element A: (Addr. Generator) A[i] completing: ________ i Counter Total: ______________ B: Define: Stage 1 Stage 2 B[i] �������� ���� ������� � Stage 1 Stage 2 C: �������� ���� 5ns 5ns ������� � 1000�� Clock Cycle 0 A[0] + B[0] ______�� � ____ Clock Cycle 1 A[1] + B[1] (A[0] + B[0]) / 4 Clock freq: = _________ Clock Cycle 2 A[2] + B[2] (A[1] + B[1]) / 4 17.15 17.16 Need for Registers Pipelining Example • By adding more pipelined stages we can improve throughput • Provides separation between combinational functions • Have we affected the latency of processing individual – Without registers, fast signals could “catch-up” to data values in the elements? ____________ next operation stage • Questions/Issues? for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; – ____________ stage delays – ___________ of registers (Not free to split stages) Time for 0 th elements to 5 ns complete: __________ • This limits how much we can split our logic Signal i Signal j Time between each of the 2 ns remaining 99 element completing: ________ Total: ______________ CLK CLK ������� � 1000�� 257.5�� � 4" Performing an We don’t want signals from two operation yields different data values mixing. signals with different Therefore we must collect and paths and delays synchronize the values from Stage 1 Stage 2 Stage 3 Stage 4 the previous operation before passing them on to the next 2.5ns 2.5ns 2.5ns 2.5ns
17.17 17.18 Non-Pipelined Processors Pipelined Processors • Currently we know our processors execute software • By breaking our processor hardware for instruction execution 1 instruction at a time into stages we can overlap these stages of work • 3 steps/stages of work for each instruction are: • Latency for a single instruction is the _____________ • Overall throughput, and thus total latency, are greatly – ___________ improved – ___________ – ___________ time time instruc. i F D E instruc. i F D E instruc. i+1 F D E instruc. i+1 F D E instruc. i+2 instruc. i+2 instruc. i+3 17.19 17.20 More and More Stages Summary • We can break the basic stages of work into • By investing extra hardware we can improve the substages to get better performance • In doing so our clock period goes ______; overall latency of computation frequency goes _____ • Measures of performance: All kinds of interesting issues come up • though when we overlap instructions and – Latency is start to finish time are discussed in future CENG courses – Throughput is tasks completed per unit time (measure of parallelism) Clock freq. = 1/__ns = ___MHz Clock freq. = 1/10ns = 100MHz • Caching reduces latency by holding data we will use time time in the future in quickly accessible memory 10ns 10ns 10ns 5ns 5ns5ns 5ns5ns 5ns instruc. i F D E F1F2D1D2E1E2 instruc. i • Pipelining improves throughput by overlapping F D E F1F2D1D2E1E2 instruc. i+1 instruc. i+1 processing of multiple items (i.e. an assembly line) F1F2D1D2E1E2 instruc. i+2 F D E instruc. i+2 F1F2D1D2E1E2 instruc. i+3 F D E instruc. i+3
Recommend
More recommend