What Does Performance Mean? � Response time Lecture 2: Performance – A simulation program finishes in 5 minutes � Throughput Evaluation – A web server serves 5 million request per second Performance definition, � Other metrics benchmark, summarizing – MIPS (million instruction per second) performance, Amdahl’s law, and – MFLOPS CPI – Clock frequency Execution Time Performance of Computers Performance is defined for a program and a � Processor design is concerned with processor machine . consumed by program execution. Shorter How to compare computers? Need benchmark execution time=> programs: – Shorter response time – Real applications: scientific programs, compilers, text-processing software, image processing – Higher throughput – Modified applications: providing portability and � Execution time = #inst×CPI×Cycletime focus – What affects #inst, CPI, and cycle time? – Kernels: good to isolate performance of individual features – Almost all designs can be interpreted � Lmbench: measure latency and bandwidth of memory, file � Any other metrics is meaningful only if system, networking, etc. consistent with execution time – Toy benchmarks – Synthetic benchmarks: matching average execution profile Performance Comparison Benchmark Suite “X is n times faster than Y”: � Benchmark suite is a collection of benchmarks with a variety of applications Execution time Performanc e – Alleviating weakness of a single benchmark = y = x n – More representative for computer designers to evaluate Performanc e Execution time their design y x – Benchmarks test both computer and compilers, and OS in � n : speedup if we are considering an many cases enhancement, optimization, etc. � Desktop benchmarks: CPU, memory, and graphics performance � What does “improving” mean? � Sever benchmarks: throughput-oriented, I/O and OS – Improve performance: decrease execution time, intensive increase throughput � Embedded benchmarks: measuring the ability to meet – Improve execution time: decrease execution time deadline and save power – Degrade performance: the reverse of the above; brings negative speedup 1
Arithmetic Mean Summarizing Performance � Total execution time / (number of Given the performance of a set of programs, how to evaluate the performance of programs) machines? 1 n ∑ Time A B C i n = i 1 P1 (secs) 1 10 20 – Simple and intuitive P2 (secs) 1000 100 20 – Representative if the user run the Total (secs) 1001 110 40 programs an equal number of times � Which computer is the “best” one? Weighted Arithmetic Mean Geometric Means � Based on relative performance to a reference � Give (different) weights to different machine programs n ∏ Execution time ratio n n n i ∑ × = Weight Time , ∑ Weight 1 = 1 i i i i � Relative performance is consistent with = i 1 = i 1 different reference machines Geometric mean(X ) X i = i Geometric mean( ) – Considering the frequencies of programs in Geometric mean(Y ) Y the workload i i – If C is 2x faster than B (using B as the reference), B is 2x faster than A (A as the reference), then C is 4x faster than A (A as the reference) Harmonic Mean Amdahl’s Law � Given speedups s1, s2, …, s_n, the We know about performance: defining, average speedup by harmonic mean is measuring, and summarizing How to maximize performance gains from the beginning in our design? n / (1/s1 + 1/s2 + … + 1/s_n) � Principle: Make the Common Case Fast! Why not arithmetic mean? 2
Amdahl’s Law Amdahl’s Law � Predict overall speedup from “local = Execution time Execution Time new old speedup” by an enhancement, provided Fraction ( ) × − + the frequency to use the enhancement 1 Fraction enhanced enhanced Speedup is know. enhance Execution time = Speedup old – “Local speedup” is related to design and overall Execution time new optimization objectives, like to double CPU 1 frequency, to reduce cache latency by half = Fraction ( ) + 1 - Fraction enhanced enhanced Speedup enhanced Equation Based on Instruction Amdahl’s Law Types Assume we need to improve the performance of = × CPU time CPU Clock Cycles Clock cycle time a graphics engine Choice one: Speed up FP Square root by 10x n Choice two: Speed up all FP instruction by 1.6x = × CPU Clock Cycles ∑ IC CPI i i = Assume 20% inst are FP Square root, 50% for i 1 all FP inst n ⇒ = × × CPU time ∑ IC CPI Clock cycle time i i Which choice is better? = i 1 n ∑ Implication: Optimizing for the common case = × CPI Instructio n frequency CPI i i first i = 1 Make Design Choice Using CPU SPEC CPU Benchmark Time Equation FP FPSQR Other � SPEC: Standard Performance Evaluation Corporation Frequency 25% 2% 75% � CPU-intensive benchmark for evaluating CPI 4.0 20 1.33 processor performance of workstation � Four generations: SPEC89, SPEC92, Alternative 1: CPI FPSQR 20 → 2 SPEC95, and SPEC2000 Alternative 2: CPI FP 4 → 2.5 � Two types of programs: INT and FP � Emphasizing memory system Which one is better? Calculate speedups. performance in SPEC2000 3
SPEC CPU2000 Profiling Other SPEC Benchmarks � SPECviewperf and SPEapc: 3D graphics Dynamic instruction mix performance Instruction Int avg FP avg � SPEC JVM98: performance of client- Load int 26% 15% Store int 10% 2% side Java virtual machine Load fp - 15% � SPEC JBB2000: Server-cline Java Store fp - 7% application Add 19% 23% � SPEC WEB99: evaluating WWW servers All fp inst - 41% � SPEC HPC96: parallel and distributed Cond br. 12% 4% computing All ctrl inst 16% 4% Server Benchmarks Embedded Benchmark � SPEC CPU2000, WBB99, SFS97 � EEMBC (Embedded Microprocessor � TPC Measuring the ability of a system Benchmark Consortium) benchmarks to handle transactions – Based on kernel performance – TPC-C: online transaction processing (OLTP) – Five classes: automotive/industrial, benchmark (for bank systems) consumer networking, office automation, – TPC-H: ad hoc decision make support and telecommunications – TPC-R: decision make support with standard queries Embedded benchmarks are not mature – TPC-W: simulating business-oriented transactional web server 4
Recommend
More recommend