lecture 2 architectural performance laws and rules of
play

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. - PowerPoint PPT Presentation

Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) ISA, RT, Gate, Circuit


  1. Lecture 2: Architectural Performance Laws and Rules of Thumb Prof. V. Catania Lab. Calcolatori

  2. Measurement Tools • Benchmarks, Traces, Mixes • Cost, delay, area, power estimation • Simulation (many levels) – ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental Laws

  3. The Bottom Line: Performance (and Cost) "X is n times faster than Y" means ExTime(Y) Performance(X) --------- = --------------- ExTime(X) Performance(Y) • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde

  4. Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n --------- = -------------- = 1 + ----- ExTime(X) Performance(Y) 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

  5. Example ExTime(Y) 15 1.5 Performance (X) = = = ExTime(X) 10 1.0 Performance (Y) 100 (1.5 - 1.0) n = 1.0 n = 50%

  6. Legge di Amdahl MAKE THE COMMON CASE FAST! Il performance improvement che può essere guadagnato rendendo una qualche attività più veloce è limitato dalla frazione di tempo in cui tale attività ha luogo. SPEEDUP : misura di quanto più veloce un task gira sulla macchina ENHANCED

  7. Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: ExTime(E) = Speedup(E) =

  8. Amdahl’s Law ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced 1 ExTime old Speedup overall = = (1 - Fraction enhanced ) + Fraction enhanced ExTime new Speedup enhanced

  9. Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTime new = Speedup overall =

  10. Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTime new = ExTime old x (0.9 + .1/2) = 0.95 x ExTime old 1 Speedup overall = = 1.053 0.95

  11. Legge di Amdahl Improve x5 in CPU speed Increase x5 cost CPU use w/e: 50% of time (50% I/O) CPU cost = 1/3 Total Computer Cost Evaluate the investment from cost/performance viewpoint 1 Speedup = = 1,67 0,5  0,5 5 New cost = 2 3 × 1  1 3 × 5 = 2,33 times the original cost Cost increase > performance improvement!

  12. Legge di Amdahl FPSQR ops. responsible of 20% of Execution time FP ops. responsible of 50% of Execution time Alternative enhancements: 1. To make a HW implementation of FPSQR ops. with a speed up of 10 2. To increase ALL FP ops. to RUN 2x FASTER with the same cost of 1 Comparison: 1 1 Speedup FPSQR = = 1,22 Speedup FP = = 1,33 1-0,2  0,2 1-0,5  0,5 10 2

  13. Corollary: Make The Common Case Fast • All instructions require an instruction fetch, only a fraction require a data fetch/store. – Optimize instruction access over data access • Programs exhibit locality Spatial Locality Temporal Locality • Access to small memories is faster – Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Disk / Tape Memory

  14. Legge di Amdahl • Cache memory 5x FASTER of Main memory • 90% CPU time is spent in a fraction of code which could be put in cache What is the Speedup overall using cache? 1 Speedup =  1-% time cache can be used  % time cache can be used Speedup using cache 1 Speedup = = 3,6  1 − 0,9  0,9 5

  15. Occam's Toothbrush • The simple case is usually the most frequent and the easiest to optimize! • Do simple, fast things in hardware and be sure the rest can be handled correctly in software

  16. Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: ISA MFLOP/s Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

  17. Cycles Per Instruction CPU time = CK cycles for a program × T CK Average Cycles per Instruction = CPU time × CK rate CPI = CK cycles for a program Instruction Count Instruction Count Instruction Frequency n I i CPI = ∑ CPI i × F i F i = where Instruction Count i = 1 n CPUtime= ICxCPIxTck = IC × Tck × ∑ CPI i × F i i = 1 NB: CPI i should be measured and not just derived from CPU Ref. Manual (it must include cache misses, etc.)

  18. Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle Instr. Cnt CPI Clock Rate Program Compiler Instr. Set Organization Technology

  19. Aspects of CPU Performance CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

  20. CPI Example Base Machine (A): COMPARE + BRANCH  2 separate instructions OP Freq. Cycles CPI(i) BRANCH 20% 2 0,4 COMP 20% 1 0,2 Others 60% 1 0,6 100% 1,2 New Machine (B): COMPARE + BRANCH  1 integrated instruction OP Freq. Cycles (T CK B = 1,25 T CK A ) BRANCH ? 2 others ? 1 100% Which machine is faster?

  21. CPI Example CPU time A = I C A × 1,2 × T CK A Machine B: Branch freq . = 20% I C A I C B = I C A − 20%I C A = 0,8I C A = 25 80% I C A OP Freq. Cycles CPI(i) BRANCH 25% 2 0,5 others 75% 1 0,75 100% 1,25 CPU time B = I C B × CPI B × 1,25 × T CK A = 0,8 I C A × 1,25 × 1,25 T CK A = = 1,25 × I C A × T CK A CPU A is FASTER than CPU B

  22. Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Machines with different instruction sets ? • Programs with different instruction mixes ? – Dynamic frequency of instructions • Uncorrelated with performance MFLOP/s = FP Operations / Time * 10^6 • Machine dependent Normalized: Normalized: • Often not where time is spent add,sub,compare,mult 1 add,sub,compare,mult 1 divide, sqrt 4 divide, sqrt 4 exp, sin, . . . 8 exp, sin, . . . 8

  23. Cycles Per Instruction “Average Cycles per Instruction” CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles n CPU time = CycleTime *  CPI * I i i i = 1 “Instruction Frequency” n CPI =  CPI * F where F = I i i i i i = 1 Instruction Count Invest Resources where time is Spent!

  24. Organizational Trade-offs Application Programming Language Compiler Instruction Mix ISA CPI Datapath Control Function Units Cycle Time Transistors Wires Pins

  25. Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch20% 2 .4 (27%) 1.5 Typical Mix

  26. Example Add register / memory operations: – One source operand in memory – One source operand in register – Cycle count of 2 Branch cycle count to increase to 3. What fraction of the loads (in the base machine) must be eliminated for this to pay off? Base Machine (Reg / Reg) Op Freq Cycles ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2 Typical Mix

  27. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles ALU .50 1 .5 Load .20 2 .4 Store .10 2 .2 Branch .20 2 .3 Reg/Mem 1.00 1.5

  28. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Cycles New Instructions New CPI New must be normalized to new instruction frequency

  29. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr Cnt Old x CPI Old x Clock Old = Instr Cnt New x CPI New x Clock New 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X)

  30. Example Solution Exec Time = Instr Cnt x CPI x Clock Op Freq Cycles Freq Cycles ALU .50 1 .5 .5 – X 1 .5 – X Load .20 2 .4 .2 – X 2 .4 – 2X Store .10 2 .2 .1 2 .2 Branch .20 2 .3 .2 3 .6 Reg/Mem X 2 2X 1.00 1.5 1 – X (1.7 – X)/(1 – X) Instr Cnt Old x CPI Old x Clock Old = Instr Cnt New x CPI New x Clock New 1.00 x 1.5 = (1 – X) x (1.7 – X)/(1 – X) 1.5 = 1.7 – X 0.2 = X ALL loads must be eliminated for this to be a win!

Recommend


More recommend