Performance Questions � How to characterize the performance of applications and systems? � User’s requirements in performance and cost? � How about performance measurement? � How will system perform when having more resources or How will system perform when having more resources or more workload? 2110412 Parallel Comp Arch Performance and Benchmarking Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University Important Keywords Performance Metrics � Peak Performance � Indicators of how good the systems are. � Theoretical performance. � To evaluate correctly, we must consider: � Typically, peak of single CPU * n � What is the metric (or metrics) ? � Sustained Performance � What is its definition ? � The maximal achievable performance by running a � How to measure it ? Benchmark algorithm ? benchmark. � What is the evaluating environment ? � Configuration. � Workload.
Popular Metrics Execution Time � Time - Execution Time � Aka. Wall clock time, elapsed time, delay. � Rate - Throughput and Processing Speed � CPU time + I/O + user + … � Resource – Utilization � The lower, the better. � Ratio - Cost Effectiveness � Factors � Algorithm. Algorithm. � Reliability – Error Rate � Reliability – Error Rate � Data structure. � Availability – Mean Time To Failure (MTTF) � Input. � Hardware/Software/OS. � Language. Definition of Time Analysis of Time � Let’s try “time” command for Unix 90.7u 12.9s 2:39 65% � User time = 90.7 secs � User time = 90.7 secs � System time = 12.9 secs � Elapsed time = 2 mins 39 secs = 159 secs � (90.7 + 12.9) / 159 = 65% � Meaning?
Processing Speed Throughput � Number of jobs that can be processed in a unit time. � Aka. Bandwidth (in communication). � How fast can the system execute ? � The more, the better. � MIPS, MFLOPS. � High throughput does not necessary mean low execution � The more, the better. time. time. � Can be very misleading !!! � Can be very misleading !!! � Pipeline. � Multiple execution units. k = m + n; for j=0 to x for j=0 to x/4 k = m + n; k = m + n; k = m + n; k = m + n; k = m + n; k = m + n; k = m + n; ... k = m + n; Utilization Cost Effectiveness � The percentage of resources � Peak performance/cost ratio being used � Price/performance ratio � Ratio of � PCs are much better in this category than Supercomputer � busy time vs. total time � sustained speed vs. peak speed � The more the better? � True for manager � But may be not for user/customer � Resource with highest utilization is the “bottleneck”
Price/Performance Ratio Moore’s Law (1965) From Tom’s Hardware Guide: CPU Chart 2009 Performance of Parallel Systems Kurzweil: The Law of Accelerating Returns � Factors � Components and architecture. � Degree of Parallelism. � Overheads. � Architecture � CPU speed. � Memory size and speed. � Memory hierarchy.
Parallelism and Overheads Parallelism and Overheads � Execution time � Tseq – Time spent in Sequential � Only one node (usually master) do the job T = Tpar + Tseq + Tcomm � Load / save data from disk � Critical sections � Tpar – Time spent in Parallel � Usually, occurs during start and end of program � All nodes execute at the same time All nodes execute at the same time � Tcomm - Communication overhead � Computation Time (mostly) � Communication between nodes � Depends on Algorithm � Data movement � Load-imbalance (Degree of Parallelism) � Synchronization: barrier, lock, and critical region � Aggregation: reduction. Execution Time Components Speedup Analysis � How good the parallel system is, when compared to the � Given program with Workload W: sequential system � Let α be the percentage of SEQUENTIAL portion in this program � Predict the scalability � Parallel portion = 1 - α � Speedup metrics � Amdahl’s Law � Gustafson’s Law W W W ( 1 ) = + − α α
Execution Time Components Speedup Formula � Suppose this program requires T time units on SINGLE processor: � T = Tpar + Tseq + Tcomm � Tpar = (1 - α )T Sequential Sequential execution execution time time � Tseq = α T � Tseq = T Speedup = Speedup Parallel execution time � For simplicity ignore Tcomm T T T ( 1 ) = + − α α Amdahl’s Law Amdahl’s Law (2) � Aka. Fixed-Load (Problem) Speedup α T � Given workload W, how good it is if we have n processors Time (ignore communication) ? (1−α) T Time Time to to execute execute W W on on 1 1 processor processor S n = Time to execute W on n processor T T T ( 1 ) = α + − α Number of processors T n 1 S n n as = = → → ∞ T T n n ( 1 ) / 1 ( 1 ) � Very popular (and also pessimistic). + − + − α α α α
Impact of Parallel Portion (1 - α ) Example 1 � 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? Example 2 Limitations of Amdahl’s Law � 20% of a program’s execution time is spent within � Ignores Tcomm inherently sequential code. What is the limit to the � Overestimates speedup achievable speedup achievable by a parallel version of the program? � Very pessimistic � When people have bigger machines, they always run bigger programs � Thus, when people have more processors, they usually run bigger workloads � More workloads = more parallel portion � Workload may not be fixed, but SCALE
Problem Size and Amdahl’s Law Gustafson’s Law � Aka. Fixed-Time Speedup (or Scaled-Load Speedup). � Given a workload W, suppose it takes time T to execute W Speedup on 1 processor. n = 10,000 � With the same T, how much (workload) we can run on n processors ? Let’s call it W’. � Assume the sequential work remains constant. n = 1,000 n = 100 W W W W W nW ( 1 ) ' ( 1 ) = α + − α = α + − α Processors Weather Prediction Gustafson’s Law (2) � Fixed-Time Speedup Workload size that can be executed in time T with n processors S ′ n = Workload Workload size size that that can can be be executed executed in in time time T T with with 1 1 processors processors W W nW ( 1 ) ′ S n + − n α α ( 1 ) ′ = = = + − W W α α 2110412 Parallel Comp Arch Natawut Nupairoj, Ph.D.
Gustafson’s Law (3) Example 1 � An application running on 10 processors spends 3% of its α W time in serial code. What is the scaled speedup of the Time application? (1−α) nW X 1 X 2 X 3 X 4 X 5 Number of processors Example 2 Performance Benchmarking � What is the maximum fraction of a program’s parallel � Benchmark execution time that can be spent in serial code if it is to � Measure and predict the performance of a system achieve a scaled speedup of 7 on 8 processors? � Reveal the strengths and weaknesses � Benchmark Suite � A set of benchmark programs and testing conditions and procedures � Benchmark Family � A set of benchmark suites
Benchmarks Classification Popular Benchmark Suites � By instructions � SPEC � Full application � TPC � Kernel -- a set of frequently-used functions � LINPACK � By workloads � Real programs � Synthetic programs SPEC CINT2006 � By Standard Performance Evaluation Corporation 400.perlbench C PERL Programming Language 401.bzip2 C Compression � Using real applications 403.gcc C C Compiler � http://www.spec.org 429.mcf C Combinatorial Optimization � SPEC CPU2006 445.gobmk C Artificial Intelligence: go � Measure CPU performance Measure CPU performance 456.hmmer C Search Gene Sequence � Raw speed of completing a single task 458.sjeng C Artificial Intelligence: chess � Rates of processing many tasks 462.libquantum C Physics: Quantum Computing � CINT2006 - Integer performance 464.h264ref C Video Compression � CFP2006 - Floating-point performance 471.omnetpp C++ Discrete Event Simulation 473.astar C++ Path-finding Algorithms 483.xalancbmk C++ XML Processing
Recommend
More recommend