A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting Algorithm Splitting Multicasting Algorithm Takahiro Katagiri (Information Technology Center, The University of Tokyo) Shoji Itoh (Advanced Center for Computing and Communication, RIKEN) (Currently, with Information T echnology Center, The University of T okyo) VECPAR’10 9 th I 9 th International Meeting High Performance Computing for Computational Science l M H h P f C f C l S CITRIS, UC Berkeley, CA, USA June 23 (Wednesday) Session VI: Solvers on Emerging Architectures (Room 250) 1 18:00 – 18:25 (25 min.)
Outlines Outlines Background Communication Splitting Multicasting C S l M l Algorithm for Symmetric Dense Eigensolver Performance Evaluation Performance Evaluation ◦ T2K Open Supercomputer (U.Tokyo) AMD Opteron Quad Core (Barcelona) AMD O Q d C (B l ) ◦ RICC PRIMERGY RX200S (RIKEN) Intel Xeon X5570 Quad Core (Nehalem) Conclusion 2
BACK GROUND BACK GROUND 3
Issue To Establish 100,000 Parallelism Issue To Establish 100,000 Parallelism Need for New “Design Space” Need for New Design Space 1. Load Imbalance Big blocking length for data distribution and computation g g g p damages load balance in massively parallel processing (MPP). In ScaLAPACK, one “big” block size is used in BLAS operation and data distribution. Ex: Block size 160 Minimum Executable Matrix Size In the case of 10,000 cores : The size is 16,000. In the case of 100,000 cores: The size is 50,596. The whole matrix size is NOT small! Execution with the minimal sizes causes very heavy load imbalance. 2 2. Communication Pattern and Performance C i i P d P f 1D Data Distribution : All cores are occupied in one collective operation. In 1D Data Distribution : MPI_ALLREDICE with 10,000 cores 1 group MPI ALLREDICE with 10 000 cores * 1 group In 2D Data Distribution : MPI_ALLREDUCE with 100 cores * (100 groups simultaneously) 3. 3. Communication Hiding Implementation Communication Hiding Implementation Previously Computation and Non-blocking Communication 4
The Aim of This Study The Aim of This Study T o Establish Eigensolver Algorithm for Small Sized Matrix and MPP ◦ Conventional Design Space Small-scale parallelism : Up to 1,000 cores. p p “Ultra” Large Scale Execution for the Matrix on MPP: 100,000~1,000,000 Size of Matrix Dimension. It is too big to perform the solver in actual supercomputer service It is too big to perform the solver in actual supercomputer service. What is “Small Size” for the target? ◦ The work area size per core matches L1~L2 caches ◦ The work area size per core matches L1~L2 caches. What is MPP for the target? ◦ From 10,000 Cores To 100,000 Cores. F 10 000 C T 100 000 C ◦ Flat MPI Model. Hybrid MPI is also covered if we can establish principal MPP H b id MPI i l d if t bli h i i l MPP algorithm. 5
Our Design Space for the Solver Our Design Space for the Solver 1. Improvement of Load Imbalance Use “Non-blocking” Algorithm Data distribution size can be permanently ONE. No load imbalance cased by the data distribution is happen. Do Not Use “Symmetricity” Do Not Use Symmetricity Simple Computation Kernel and High Parallelism. Increase Computation Complexity. 2. Data Distribution for MPP Use 2D Cyclic Distribution y (Cyclic, Cyclic)Distribution with Size of One :Perfect Load Balancing Multi-casting Communication: Reduce communication time of MPI BCAST and MPI ALLREDUCE even if the number of cores MPI_BCAST and MPI_ALLREDUCE even if the number of cores or vector size are increased. Use Duplication of Pivot Vectors Reduce gathering communication time. 3. Future work: Communication Hiding 6
AN EIGENSOLVER AN EIGENSOLVER ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR MASSIVELY PARALLEL MASSIVELY PARALLEL COMPUTING COMPUTING 7
The Eigenvalue Program The Eigenvalue Program Standard Eigenproblem x A Ax x x : An Eigenvector : An Eigenvalue • Application Field pp • Several Science and Technology Problems • Quantum Chemistry Dense and symmetric. • Requires most parts of eigenvalues and eigenvectors. • • Searching on the Internet (Knowledge Discovery) Searching on the Internet (Knowledge Discovery) [M.Berry et.al., 1995] 3 • Dense: Computational Complexity is Dense: Computational Complexity is O O ( ( n n ) ) • Needs to implement parallelization. 8
A Classical Sequential Algorithm Ax Ax x x (Standard Eigenproblem (Standard Eigenproblem ) ) Symmetric Dense Matrix 2. 2. 2 Bisection Bisection Bisection Bisection 1. 1. Householder Householder Transformation Transformation T: Tridiagonal matrix T: Tridiagonal matrix All eigenvalues : Λ All eigenvalues : Λ All eigenvalues : All eigenvalues : T 3. 3. Inverse Inverse A Iteration Iteration Iteration Iteration Q T AQ=T Q T AQ=T =T =T T T : Tridiagonal matrix Tri- Tri -diagonalization diagonalization Tridiagonal All eigenvectors: Y All eigenvectors: Y All eigenvectors: All eigenvectors: 3 3 matrix matrix O O ( ( n ) ) 2 3 O ( n ) ~ O ( n ) Q=H =H 1 1 H 2 … 2 … H n- -2 2 4. Householder 4 4. Householder Inverse H H h ld h ld I Inverse 2 MRRR: MRRR: O ( n ) Transformation Transformation A D A D A: Dense matrix A: Dense matrix t i t i 3 All eigenvectors All eigenvectors : X = O ( n ) X = Q QY Y 9
Basic Operations of Householder Basic Operations of Householder Tridiagonalization ( Tridiagonalization (Non Tridiagonalization ( Tridiagonalization (Non Non-blocking Non blocking blocking Ver.) blocking Ver) Ver.) Ver) ( k ) Let A be a matrix when the k-th iteration. The operations in the k-th iteration are follows. p 1 n ( , u ) k k ( k ) A ( , u ) :Householder Refraction k : n , k k k T T H H I I u u :Householder Operator k k k k k ( k 1 ) ( k ) A H A H :HouseholderTransformation k k do k 1 , n 2 T T ( k ) y y u u A A : ① Matrix-vector Multiplication : ① Matrix vector Multiplication k k : n , k k : n k k k T y u : ② Dot-products k k k k x y : ③ Copy (when symmetric) k k T T ( k ) ( k ) H H A A H H A A ( ( x x u u ) ) u u u u y y k k k k k k k k : ④ Matrix Updating 10
Communication Time Reduction For Householder Tridiagonalization For Householder Tridiagonalization 2D Cyclic 2D Cyclic- -Cyclic Distribution Cyclic Distribution PE1 Multi- Multi- PE2 PE2 + Broadcasts PE3 PE4 Drawback It increases number of communication. Perfect Load-balancing Merit p : #Processes Reduces Communication Volume n n : Problem Size : Problem Size 2 O ( n log p ) 2 O ( n / p log p ) 2 2 11
An Example: HITACHI SR2201(Yr.2000) An Example: HITACHI SR2201(Yr.2000) (Hessenberg (Hessenberg Hessenberg Reduction: Hessenberg Reduction: Reduction: n =4096) Reduction: n =4096) =4096) =4096) 1000 sec] (*,Block) Time [s 100 (*,Cyclic) T 1D (Bl (Block,Block) k Bl k) Distribution (Cyclic,Cyclic) 2D Distribution 10 4 16 32 64 128 256 #Processes
Effect on 2D Distribution in Our Method Effect on 2D Distribution in Our Method (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder HouseholderTridiagonalization HouseholderTridiagonalization Tridiagonalization ) Tridiagonalization ) Time [Second] 4 Processes 64 Processes Time [Second] 10000 1000 1000 1000 100 100 5.4x 10 10 3 6 3.6x 10 10 1 1 0.1 0 1 0.1 0.01 100 200 400 800 1000 2000 4000 8000 100 200 400 800 1000 2000 4000 8000 Problem Size Our Problem Size ScaLAPACK Our ScaLAPACK Time [Second] 128 Processes 128 Processes 512 Processes 512 Processes Time [Second] Time [Second] 1000 1000 325 Seconds 100 4.6x 100 81 Seconds 10 5.7x 10 1 1 0.1 1000 2000 4000 8000 10000 20000 200 400 800 1000 2000 4000 8000 10000 Problem Size Problem Size ScaLAPACK Our ScaLAPACK Our
Whole Parallel Processes on the Eigensolver Whole Parallel Processes on the Eigensolver Gather Tridiagonalization All Elements T T T T A T T T Compute Upper and Lower limits For eigenvalues Compute Eigenvectors Compute Eigenvectors Upper Λ Lower 1, 2 , 3 , 4… (Rising Order) 1, 2 , 3 , 4… (Rising Order) Gather Y All Eigenvalues Λ 1, 2 , 3 , 4… ( Corresponding to Rising Order for the eigenvalues 14
Data Duplication for the Tridiagonalization u k u k Duplication of Duplication of Matrix Matrix A A u k :Vectors :Vectors u k , , x k y k Duplication of y k p Processes p :Vectors :Vectors y k, , q Processes 15
Recommend
More recommend