Speedup for Multi-Level Parallel Computing Shanjiang Tang , Bu-Sung Lee, Bingsheng He School of Computer Engineering Nanyang Technological University 21 st May 2012
OutLine • Background & Motivation • Multi-Level Parallel Speedup • Evaluation • Conclusion
Multi-Level Computing Architecture and Paradigm
Multi-Level Computing Architecture and Paradigm • MPI+OpenMP • MPI+CUDA • MPI+OpenMP+CUDA … ..
Multi-Level Parallel Computing Model L L1 L2 L3 Lm L PE 3,1 PE 3,2 L L PE 2,1 PE 3,3 PE 3,4 L PE 1,1 L PE 3,5 PE 3,6 L PE 2,2 L PE 3,7 PE 3,8 L Notes: Sequential Part Parallel Part
Parallel Speedup • Definition SequentialExecutionTime = Speedup ParallelExecutionTime • Classification Ø Absolute Speedup BestSequentialALGExecutionTime = Speedup ParallelALGExecutionTime Ø Relative Speedup ParallelALGSequentialExecutionTime = Speedup ParallelALGExecutionTime
Relative Speedup Model • Fixed-size Speedup Ø Amdahl’s Law sequential Time 1 Speedup = = parallelTi me α 1 − α + p p where is parallel fraction workload of the program, is the α number of processors. • Fixed-time Speedup Ø Gustafson’s Law sequential Time 1 p − α + α Speedup 1 p = = = − α + α parallelTi me 1 − α + α
Motivation Example—NAS Benchmark (MPI+OpenMP)
Motivation Example—NAS Benchmark (MPI+OpenMP) Amdahl’s Law is UNSUITABLE for Multi-Level Parallel Computing
OutLine • Background & Motivation • Multi-Level Parallel Speedup • Evaluation • Conclusion
E-Amdahl’s Law • Awareness of Different Grained-Level Parallelism L L1 L2 L3 Lm 1 ⎧ ( i m ) = L ⎪ f ( m ) 1 f ( m ) PE 3,1 − + PE 3,2 ⎪ L p ( m ) ⎪ L ⎪ PE 2,1 sp ( i ) = ⎨ PE 3,3 PE 3,4 PE 1,1 L 1 ⎪ ( 1 i m ) ≤ < ⎪ L f ( i ) 1 f ( i ) ⎪ − + PE 3,5 PE 3,6 p ( i ) sp ( i 1 ) ⎪ + L PE 2,2 ⎩ L PE 3,7 PE 3,8 L Notes Sequential Parallel : Part Part
E-Amdahl’s Law • Two-Level Parallelism Speedup Model (MPI+OpenMP) 1 sp ( , , p , t ) α β = β ( 1 ) α − β + t 1 − α + p where is the parallel fraction of coarse-grained (MPI-level) parallelism. α is the parallel fraction of fine-grained (OpenMP-level) parallelism. β is the number of processes spawned. p t is the number of threads spawned per process.
E-Gustafson’s Law • Awareness of Different Grained-Level Parallelism L L1 L2 L3 Lm L 1 f ( m ) f ( m ) p ( m ) ( i m ) − + = ⎧ PE 3,1 sp ( i ) PE 3,2 = ⎨ L 1 f ( i ) f ( i ) p ( i ) sp ( i 1 ) ( 1 i m ) − + + ≤ < L ⎩ PE 2,1 PE 3,3 PE 3,4 PE 1,1 L L PE 3,5 PE 3,6 L PE 2,2 L PE 3,7 PE 3,8 L Notes Sequential Parallel : Part Part
OutLine • Background & Motivation • Multi-Level Parallel Speedup • Evaluation • Conclusion
Experiment Setup • Platform and Configuration Ø A linux cluster consisting of eight computing nodes each with two quad-core chips Ø Configuration: One thread per CPU core • Benchmarks NAS Parallel Benchmark (NPB) Multi-Zone (MZ) Version: Ø BT-MZ (Unbalanced Workload Partitioning) Ø SP-MZ (balanced Workload Partitioning) Ø LU-MZ (balanced Workload Partitioning)
Performance Prediction
Prediction Result Comparison
OutLine • Background & Motivation • Multi-Level Parallel Speedup • Evaluation • Conclusion
Conclusion • Traditional speedup models are unsuitable for multi-level parallelism – Unable to be awareness of different granularities of parallelism for multi-level parallel computing. • Multi-level Parallelism Model – A guidance model for multi-level optimization. – A prediction model for multi-level parallelism.
Argument Estimation
Speedup Under E-Amdahl’s Law
Speedup Under E-Gustafson’s Law
Recommend
More recommend