designing for
play

Designing for Performance Raul Queiroz Feitosa Objective In this - PowerPoint PPT Presentation

Designing for Performance Raul Queiroz Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings 2 Designing for Performance Outline Performance


  1. Designing for Performance Raul Queiroz Feitosa

  2. Objective “In this section … we examine the most common approach to assessing processor and computer system performance” W. Stallings 2 Designing for Performance

  3. Outline  Performance Assessment  Amdahl’s Law 3 Designing for Performance

  4. Performance Assessment Which one would you choose? Intel Xeon Platinum 8280L EPYC 7601 Cache 38.5 MB Cache 64 MB Freq.: 2.7 GHz Freq.: 2.2 GHz 28 Cores 32 Cores 4 Designing for Performance

  5. Performance Assessment What matters?  Cost  Size  Reliability  Security  Power Consumption  Performance 5 Designing for Performance

  6. Performance Assessment Main CPU operations  Seek and decode instructions  Load and Store data  Logic and Arithmetic Operations  Fixed-Point  Floating-Point 6 Designing for Performance

  7. Performance Assessment Performance factors  Clock speed or clock rate ( f ) Expressed in multiples of Hz.  Clock cycle or clock tick one increment, or pulse, of the clock .  Clock time ( τ ) time between consecutive pulses. 7 Designing for Performance

  8. Performance Assessment Performance factors  Clock speed  Usually multiple clock cycles are required per instruction.  The amount of work implied by one instruction varies considerably.  Pipelining gives simultaneous execution of instructions.  So, clock speed is not the whole story! 8 Designing for Performance

  9. Performance Assessment Performance factors  Instruction Execution Rate  Expressed in Millions of instructions (MIPS) or floating point instructions (MFLOPS) per second.  Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. 9 Designing for Performance

  10. Performance Assessment Performance factors  CPI - average number of cycles per instructions  I i - number of machine instructions of type i executed by a program.  CPI i - number of cycles per instruction of type i.  I c - number of machine instructions executed by a program n   I I c i  i 1 n   CPI I i i   i 1 CPI I c 10 Designing for Performance

  11. Performance Assessment Performance factors  T – processor time needed to execute a program.     T I CPI c a refinement yields         T I p ( m k ) c where p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. 11 Designing for Performance

  12. Performance Assessment Performance factors System attributes affecting the performance factors τ I c p m k   Instruction set architecture !    Compiler technology   Processor implementation   Cache and memory hierarchy 12 Designing for Performance

  13. Performance Assessment Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and the proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2  0.18) + (4  0.12) + (8  0.1) = 2.24 13 Designing for Performance

  14. Performance Assessment Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic Compute the peak performance for M 1 and M 2 in MIPS. a) If 50% of the instruction executed in a given program belong to class N and b) the other are equally distributed between F and I, which is the fastest machine and by which factor? 14 Designing for Performance

  15. Performance Assessment Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2 , i. e., discuss the instruction composition of such applications. Hint : Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. 15 Designing for Performance

  16. Performance Assessment Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 Compute the execution time for both codes assuming a clock rate = 1 GHz. a) Which compiler produce the most efficient code and by which factor? b) Which code execute at the highest MIPS? c) 16 Designing for Performance

  17. Performance Assessment Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on RISC load mem(B),reg(1); Compiled code on CISC load mem(C),reg(2); add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3); store reg(3),mem(A); Both machines execute the same high level codes in the same time. So, if MIPS CISC = 1, then MIPS RISC = 4 17 Designing for Performance

  18. Performance Assessment Benchmarks: definition  Programs designed to test performance  Written in high level language → portable  Represents a particular application or system programming area (systems, numerical, commercial)  Easily measured and widely distributed  The best known such collection of benchmark suites is the System Performance Evaluation Corporation (SPEC)  The best known of the SPEC suites is the CPU2017:  contains 43 benchmarks organized into four suites  includes an optional metric for measuring energy consumption 18 Designing for Performance

  19. Performance Assessment SPECspeed metric  Spec benchmarks do not concern with instruction execution rates  Base runtime defined for each benchmark using reference machine  Speed metric is the ratio of reference time to system run time  Tref i execution time for benchmark i on reference machine  Tsut i execution time of benchmark i on test system 19 Designing for Performance

  20. Performance Assessment SPECrate Metric  Measures throughput or rate of a machine carrying out a number of tasks  Multiple copies of benchmarks run simultaneously  Typically, same as number of processors  Ratio is calculated as follows:  Tref i reference execution time for benchmark i  N number of copies running simultaneously  Tsut i elapsed time from start of execution of all N programs until completion of all copies of program  Again, a geometric mean is calculated 20 Designing for Performance

  21. Performance Assessment Averaging SPEC metrics  For both SPECspeed and SPECrate, the selected ratios are averaged using the Geometric Mean, which is reported as the overall metric. 21 Designing for Performance

  22. Performance Assessment Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. processor benchmark X Y Z 1 20 10 40 2 40 80 20 Compute the arithmetic mean value for each system using X as the reference a) machine and then using Y as the reference machine. Compute the geometric mean value for each system using X as the reference b) machine and then using Y as the reference machine. Which is the most realistic result? 22 Designing for Performance

  23. Outline  Performance Assessment  Amdahl’s Law 23 Designing for Performance

  24. Amdahl’s Law Estimate the potential speed up of program using multiple processors  Fraction f of code parallelizable with no scheduling overhead  Fraction (1- f ) of code inherently serial  T is total execution time for program on single processor  N is number of processors that fully exploit parallel portions of code 24 Designing for Performance

  25. Amdahl’s Law Conclusions  Code needs to be parallelizable/parallelized!  f small, parallel processors has little effect.  N → ∞, speedup bound by 1/(1 – f ).  Speedup is bound, giving diminishing returns for more processors . 25 Designing for Performance

Recommend


More recommend