http://prophesy.cs.tamu.edu Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI, Evaluate the Performance of MPI, OpenMP and Hybrid Parallel OpenMP and Hybrid Parallel Applications on Dual- and Quad- Applications on Dual- and Quad- Xingfu Wu <wuxf@cs.tamu.edu> core Cray XT4 Systems core Cray XT4 Systems Xingfu Wu and Valerie Taylor Xingfu Wu and Valerie Taylor Department of Computer Science & Engineering Department of Computer Science & Engineering Texas A&M University Texas A&M University CUG2009, May 5, 2009, Atlanta, GA CUG2009, May 5, 2009, Atlanta, GA
http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �
http://prophesy.cs.tamu.edu Introduction Introduction � Chip multiprocessors (CMP) are usually Chip multiprocessors (CMP) are usually � configured hierarchically to form a compute configured hierarchically to form a compute node of CMP cluster systems. node of CMP cluster systems. Xingfu Wu <wuxf@cs.tamu.edu> � One issue is how many processor cores per One issue is how many processor cores per � node to use for efficient execution. node to use for efficient execution. � The best number of processor cores per The best number of processor cores per � node is dependent upon the application node is dependent upon the application characteristics and system configurations. characteristics and system configurations.
http://prophesy.cs.tamu.edu Processor Partitioning Processor Partitioning � Quantify the performance gap resulting Quantify the performance gap resulting � from using different number of processors from using different number of processors per node for application execution (for per node for application execution (for Xingfu Wu <wuxf@cs.tamu.edu> which we use the term processor which we use the term processor partitioning) . . partitioning) � Understand how processor partitioning Understand how processor partitioning � impacts system & application performance impacts system & application performance � Investigate how and why an application is Investigate how and why an application is � sensitive to communication and memory sensitive to communication and memory access patterns access patterns
http://prophesy.cs.tamu.edu Processor Partitioning Scheme Processor Partitioning Scheme � Processor partitioning scheme NXM Processor partitioning scheme NXM � stands for N nodes with M processor stands for N nodes with M processor cores per node (PPN) cores per node (PPN) Xingfu Wu <wuxf@cs.tamu.edu> � Using processor partitioning changes Using processor partitioning changes � the memory access pattern and the memory access pattern and communication pattern of a MPI communication pattern of a MPI program. program.
http://prophesy.cs.tamu.edu Outline Outline � Introduction Introduction � � Execution Platforms and Performance Execution Platforms and Performance � � Memory Performance Analysis Memory Performance Analysis � � STREAM benchmark STREAM benchmark Xingfu Wu <wuxf@cs.tamu.edu> � � MPI Communication Performance Analysis MPI Communication Performance Analysis � � IMB benchmarks IMB benchmarks � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy System Performance Modeling Using Prophesy System � � Summary Summary �
http://prophesy.cs.tamu.edu Dual- and Quad-core Cray XT4 Dual- and Quad-core Cray XT4 Configurations Franklin Jaguar Total Cores 19,320 31,328 Total Nodes 9,660 7,832 Xingfu Wu <wuxf@cs.tamu.edu> Cores/chip 2 4 Cores / Node 2 4 CPU type 2.6 GHz Opteron 2.1 GHz Opteron Memory/Node 4GB 8GB L1 Cache/CPU 64/64 KB 64/64 KB L2 Cache/chip 1MB 2MB Network 3D-Torus 3D-Torus
http://prophesy.cs.tamu.edu STREAM Benchmark STREAM Benchmark � Synthetic benchmarks, written in Synthetic benchmarks, written in � Fortran 77 and MPI or in C and Fortran 77 and MPI or in C and OpenMP OpenMP Xingfu Wu <wuxf@cs.tamu.edu> � Measure the sustainable memory Measure the sustainable memory � bandwidth using the unit- -stride stride bandwidth using the unit TRIAD benchmark (a(i a(i) = ) = b(i)+q b(i)+q* *c(i c(i)) )) TRIAD benchmark ( � The array size is 4M (2^22) The array size is 4M (2^22) �
http://prophesy.cs.tamu.edu Sustainable Memory Bandwidth Sustainable Memory Bandwidth Frnaklin MPI OpenMP Processor partitioning scheme 1x2 2x1 2 threads Xingfu Wu <wuxf@cs.tamu.edu> Memory Bandwidth (MB/s) 4026.53 6710.89 3565.71 Jaguar MPI OpenMP Processor partitioning scheme 1x4 2x2 4x1 4 threads Memory Bandwidth (MB/s) 5752.19 10066.33 10066.33 5606.77
http://prophesy.cs.tamu.edu Intel’s MPI Benchmarks (IMB) Intel’s MPI Benchmarks (IMB) � Provides a concise set of benchmarks Provides a concise set of benchmarks � targeted at measuring the most important targeted at measuring the most important MPI functions MPI functions Xingfu Wu <wuxf@cs.tamu.edu> � Version 2.3, written in C and MPI Version 2.3, written in C and MPI � � Using Using PingPong PingPong to measure to measure uni uni- - � directional intra/inter- -node latency and node latency and directional intra/inter bandwdith bandwdith
http://prophesy.cs.tamu.edu Uni-directional Latency and Bandwidth Uni-directional Latency and Bandwidth Uni-directional Intra-node Latency Comparison Using PingPong Uni-directional Inter-node Latency Comparison Using PingPong 10000 10000 1000 Franklin Franklin 1000 Jaguar Jaguar Latency (us, log scale) Latency (us, log scale) 100 100 10 Xingfu Wu <wuxf@cs.tamu.edu> 10 1 0.1 1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale) Uni-directional Intra-node Bandwidth Comparison Using PingPong Uni-directional Inter-node Bandwidth Comparison Using PingPong 10000 10000 Franklin Franklin 1000 1000 Band w idth (M B/s, lo g scale) Bandw idth (M B/s, log scale) Jaguar Jaguar 100 100 10 10 1 1 0.1 0.1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 Message Size (Bytes, log scale) Message Size (Bytes, log scale)
Lessons Learned from http://prophesy.cs.tamu.edu Lessons Learned from STREAM and IMB STREAM and IMB � Memory access patterns at different Memory access patterns at different � memory hierarchy levels affect memory hierarchy levels affect sustainable memory bandwidth sustainable memory bandwidth Xingfu Wu <wuxf@cs.tamu.edu> � The fewer PPN, the higher the sustainable The fewer PPN, the higher the sustainable � memory bandwidth memory bandwidth � Using all cores per node does not result Using all cores per node does not result � in the highest memory bandwidth in the highest memory bandwidth � Intra Intra- -node MPI latency/bandwidth is much node MPI latency/bandwidth is much � lower/higher than inter- -node node lower/higher than inter
http://prophesy.cs.tamu.edu Outline Outline � Introduction: Processor Partitioning Introduction: Processor Partitioning � � Execution Platforms and Performance Execution Platforms and Performance � � NAS Parallel Benchmarks (MPI, NAS Parallel Benchmarks (MPI, OpenMP OpenMP) ) Xingfu Wu <wuxf@cs.tamu.edu> � � Gyrokinetic Gyrokinetic Toroidal Toroidal code (GTC, hybrid) code (GTC, hybrid) � � Performance Modeling Using Prophesy Performance Modeling Using Prophesy � System System � Summary Summary �
http://prophesy.cs.tamu.edu NAS Parallel Benchmarks NAS Parallel Benchmarks � NPB 3.2.1 (MPI and NPB 3.2.1 (MPI and OpenMP OpenMP) ) � � CG, EP, FT, IS, MG, LU, BT, SP CG, EP, FT, IS, MG, LU, BT, SP � Xingfu Wu <wuxf@cs.tamu.edu> � Class B and C Class B and C � � Compiler Compiler ftn ftn with the options with the options - -O3 O3 – – � fastsse on Franklin and Jaguar on Franklin and Jaguar fastsse � Strong scaling Strong scaling �
Recommend
More recommend