CS305 Computer Architecture Fall 2009 Lecture 04 Bhaskaran Raman Department of CSE, IIT Bombay http://www.cse.iitb.ac.in/~br/ http://www.cse.iitb.ac.in/synerg/doku.php?id=public:courses:cs305-fall09:start
Today's Topics ● Performance metrics, CPI ● Performance comparison ● Benchmarks
Performance Comparison ● What performance metric to use? ● User cares about response time ● Performance is inversely proportional ● What is execution time? ● Response time ● CPU time: User time + System time ● System performance vs. CPU performance ● Throughput vs. response-time ● We will focus on CPU performance
Which Program's Execution Time? ● Real “workload” is ideal ● Practical options: ● Real programs: compilers, office-suite, scientific... ● Kernels: key pieces of programs – Example: Livermore loops ● Toy benchmarks: small programs – Examples: Quick-sort, tower of Hanoi... ● Synthetic benchmarks: try to capture “average” frequency of instructions in real programs – Example: Whetstone, Dhrystone
More on Performance Comparisons... ● Caveat of benchmarks ● They are needed ● But manufacturers tend to optimize for benchmarks ● Need to be updated periodically ● Benchmark suite: collection of programs ● E.g. SPEC2000 ● Reporting performance ● Reproducibility: program version, compiler, flags ● SPEC specifies compiler flags for baseline comparison
Some Numerics... Computer A Computer B Computer C Program P1 (secs) 1 10 20 Program P2 (secs) 1000 100 20 Total (secs) 1001 110 40 ● Total (or average) execution time is a possible metric ● Weighted execution time is better W i × T i
Normalizing the Performance Norm(A)Norm(A)Norm(A)Norm(B)Norm(B)Norm(B)Norm(C)Norm(C)Norm(C) A B C A B C A B C P1 1 10 20 0.1 1 2 0.05 0.5 1 P2 1 0.1 0.02 10 1 0.2 50 5 1 AM 1 5.05 10.01 5.05 1 1.1 25.03 2.75 1 GM 1 1 0.63 1 1 0.63 1.58 1.58 1 ● Normalize such that all programs take the same time, on some machine ● Arithmetic mean predicts performance ● Geometric mean?
Summary ● Performance inversely proportional to execution- time ● We are concerned with CPU time of unloaded machine ● Weighted execution time with weights from real workload is ideal ● Else, normalize w.r.t one machine
Amdahl's Law ● Amdahl's law: ● Diminishing returns 1-F 1-F ● Limit on overall speedup F/Speedup ● Corollary: make the F common case fast
Amdahl's Law ● Amdahl's law: 1-F ● Diminishing returns ● Limit on overall speedup F 1 − F F Overall speedup = F 1 − F Speedup 1-F ● Corollary: make the common case fast F/Speedup
Illustrating Amdahl's Law ● Example: implement faster memory, or faster ALU? ● Proposed memory speedup: 10x ● Proposed ALU speedup: 3x ● Depends on fraction of instructions – Suppose F mem = 0.2, F alu = 0.5, F other = 0.3 1 Speedup with faster memory = 0.8 0.2 / 10 = 1.22 1 Speedup with faster ALU = 0.5 0.5 / 3 = 1.5
Example continued... F alu = 0.5 ● Fixing for what value of is F mem going for a faster memory better? 1 1 − F mem F mem / 10 1.5 ⇒ F mem 10 27 = 0.36
The CPU Performance Equation CPU time = Num.clock cycles × Clock cycletime OR CPU time = Num.of clock cycles ÷ Clock rate For a program, Num.of clock cycles = InstructionCount × Cycles Per Instruction = IC × CPI Putting these together CPU time = IC × CPI × Cycletime
More on the Equation ● This form is convenient ● Involves many relevant parameters ● Remembering is easy CPU time = Seconds Program = Seconds Clock cycle × Clock cycles Instruction × Instructions Program ● With CPI as the independent variable CPU time CPI = Clock cycletime × IC
Other Convenient Forms of the Equation ● Number of clock cycles can be counted as: n CPU clock cycles = ∑ CPI i × IC i i = 1 n Hence ,CPU time = ∑ CPI i × IC i × Clock cycletime i = 1 ● Calculating in terms of CPI CPI i n IC i Clock cycletime × IC = ∑ CPU time CPI i × IC CPI = i = 1
Usefulness of the Equation ● easier to measure than IC i F i ● Equivalently, is measured through F i IC i ● Equation includes relevant parameters such as the cycle time
Measuring the Parameters for the Equation ● Clock cycle time: ● Easy for existing architectures ● Needs to be estimated in the design process ● Instruction Count: ● Requires a compiler ● And, simulator/interpreter, or instrumentation code ● CPI for each instruction type: ● Easy for simple architectures ● Pipelines, caches introduce complications ● Need to simulate and measure average CPI
A Design Example ● A design choice for conditional branch instructions: ● Choice 1: condition code is set by a compare instruction, checked by the next (branch) instruction – 20% instructions are branches, and another 20% are compares – 2 cycles per branch, 1 cycle for all others – Clock-rate is 25% faster ● Choice 2: single instruction for compare and branch ● Which choice is better?
Solution for Design Example CPU time 1 = IC 1 ×[ 0.8 × 1 0.2 × 2 ] = IC 1 C × 1.2 1.25 × C 1.25 CPU time 2 = IC 1 ×[ 0.6 × 1 0.2 × 2 ] = IC 1 C C
Recommend
More recommend