Designing for Performance Raul Queiroz Feitosa
Objective “In this section … we examine the most common approach to assessing processor and computer system performance” W. Stallings 2 Designing for Performance
Outline Performance Assessment Amdahl’s Law 3 Designing for Performance
Performance Assessment Which one would you choose? Intel Xeon Platinum 8280L EPYC 7601 Cache 38.5 MB Cache 64 MB Freq.: 2.7 GHz Freq.: 2.2 GHz 28 Cores 32 Cores 4 Designing for Performance
Performance Assessment What matters? Cost Size Reliability Security Power Consumption Performance 5 Designing for Performance
Performance Assessment Main CPU operations Seek and decode instructions Load and Store data Logic and Arithmetic Operations Fixed-Point Floating-Point 6 Designing for Performance
Performance Assessment Performance factors Clock speed or clock rate ( f ) Expressed in multiples of Hz. Clock cycle or clock tick one increment, or pulse, of the clock . Clock time ( τ ) time between consecutive pulses. 7 Designing for Performance
Performance Assessment Performance factors Clock speed Usually multiple clock cycles are required per instruction. The amount of work implied by one instruction varies considerably. Pipelining gives simultaneous execution of instructions. So, clock speed is not the whole story! 8 Designing for Performance
Performance Assessment Performance factors Instruction Execution Rate Expressed in Millions of instructions (MIPS) or floating point instructions (MFLOPS) per second. Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. 9 Designing for Performance
Performance Assessment Performance factors CPI - average number of cycles per instructions I i - number of machine instructions of type i executed by a program. CPI i - number of cycles per instruction of type i. I c - number of machine instructions executed by a program n I I c i i 1 n CPI I i i i 1 CPI I c 10 Designing for Performance
Performance Assessment Performance factors T – processor time needed to execute a program. T I CPI c a refinement yields T I p ( m k ) c where p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. 11 Designing for Performance
Performance Assessment Performance factors System attributes affecting the performance factors τ I c p m k Instruction set architecture ! Compiler technology Processor implementation Cache and memory hierarchy 12 Designing for Performance
Performance Assessment Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and the proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2 0.18) + (4 0.12) + (8 0.1) = 2.24 13 Designing for Performance
Performance Assessment Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic Compute the peak performance for M 1 and M 2 in MIPS. a) If 50% of the instruction executed in a given program belong to class N and b) the other are equally distributed between F and I, which is the fastest machine and by which factor? 14 Designing for Performance
Performance Assessment Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2 , i. e., discuss the instruction composition of such applications. Hint : Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. 15 Designing for Performance
Performance Assessment Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 Compute the execution time for both codes assuming a clock rate = 1 GHz. a) Which compiler produce the most efficient code and by which factor? b) Which code execute at the highest MIPS? c) 16 Designing for Performance
Performance Assessment Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on RISC load mem(B),reg(1); Compiled code on CISC load mem(C),reg(2); add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3); store reg(3),mem(A); Both machines execute the same high level codes in the same time. So, if MIPS CISC = 1, then MIPS RISC = 4 17 Designing for Performance
Performance Assessment Benchmarks: definition Programs designed to test performance Written in high level language → portable Represents a particular application or system programming area (systems, numerical, commercial) Easily measured and widely distributed The best known such collection of benchmark suites is the System Performance Evaluation Corporation (SPEC) The best known of the SPEC suites is the CPU2017: contains 43 benchmarks organized into four suites includes an optional metric for measuring energy consumption 18 Designing for Performance
Performance Assessment SPECspeed metric Spec benchmarks do not concern with instruction execution rates Base runtime defined for each benchmark using reference machine Speed metric is the ratio of reference time to system run time Tref i execution time for benchmark i on reference machine Tsut i execution time of benchmark i on test system 19 Designing for Performance
Performance Assessment SPECrate Metric Measures throughput or rate of a machine carrying out a number of tasks Multiple copies of benchmarks run simultaneously Typically, same as number of processors Ratio is calculated as follows: Tref i reference execution time for benchmark i N number of copies running simultaneously Tsut i elapsed time from start of execution of all N programs until completion of all copies of program Again, a geometric mean is calculated 20 Designing for Performance
Performance Assessment Averaging SPEC metrics For both SPECspeed and SPECrate, the selected ratios are averaged using the Geometric Mean, which is reported as the overall metric. 21 Designing for Performance
Performance Assessment Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. processor benchmark X Y Z 1 20 10 40 2 40 80 20 Compute the arithmetic mean value for each system using X as the reference a) machine and then using Y as the reference machine. Compute the geometric mean value for each system using X as the reference b) machine and then using Y as the reference machine. Which is the most realistic result? 22 Designing for Performance
Outline Performance Assessment Amdahl’s Law 23 Designing for Performance
Amdahl’s Law Estimate the potential speed up of program using multiple processors Fraction f of code parallelizable with no scheduling overhead Fraction (1- f ) of code inherently serial T is total execution time for program on single processor N is number of processors that fully exploit parallel portions of code 24 Designing for Performance
Amdahl’s Law Conclusions Code needs to be parallelizable/parallelized! f small, parallel processors has little effect. N → ∞, speedup bound by 1/(1 – f ). Speedup is bound, giving diminishing returns for more processors . 25 Designing for Performance
Recommend
More recommend