flat mpi vs hybrid evaluation of parallel programming
play

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - PowerPoint PPT Presentation

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on T2K Open Supercomputer Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel


  1. Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on “T2K Open Supercomputer” Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), September 22, 2009, Vienna to be held in conjunction with ICPP-09: The 38th International Conference on Parallel Processing

  2. P2S2 2 Topics of this Study • Preconditioned Iterative Sparse Matrix Solvers for FEM Applications • T2K Open Supercomputer (Tokyo) (T2K/Tokyo) • Hybrid vs. Flat MPI Parallel Programming Models • Optimization of Hybrid Parallel Programming Models – NUMA Control – First Touch – Further Reordering of Data

  3. P2S2 3 TOC • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

  4. P2S2 4 T2K/Tokyo (1/2) • “T2K Open Supercomputer Alliance” – http://www.open-supercomputer.org/ – Tsukuba, Tokyo, Kyoto • “T2K Open Supercomputer (Todai Combined Cluster)” – by Hitachi – op. started June 2008 – Total 952 nodes (15,232 cores), 141 TFLOPS peak • Quad-core Opteron (Barcelona) – 27th in TOP500 (NOV 2008) (fastest in Japan at that time)

  5. P2S2 5 T2K/Tokyo (2/2) • AMD Quad-core Opteron Memory Memory (Barcelona) 2.3GHz L3 L3 L3 L3 • 4 “sockets” per node L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 – 16 cores/node Core Core Core Core Core Core Core Core • Multi-core , multi-socket system Core Core Core Core Core Core Core Core • cc-NUMA architecture L1 L1 L1 L1 L1 L1 L1 L1 – careful configuration needed L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 L3 L3 • local data ~ local memory Memory Memory – To reduce memory traffic in the system, it is important to keep the data close to the cores that will work with the data (e.g. NUMA control).

  6. P2S2 6 Flat MPI vs. Hybrid Flat-MPI : Each PE -> Independent core core core memory memory memory core core core core core core core core core Hybrid : Hierarchal Structure core core core core core core memory memory memory memory memory memory core core core core core core core core core core core core core core core core core core

  7. P2S2 7 Flat MPI vs. Hybrid • Performance is determined by various parameters • Hardware – core architecture itself – peak performance – memory bandwidth, latency – network bandwidth, latency – their balance • Software – types: memory or network/communication bound – problem size

  8. P2S2 8 Sparse Matrix Solvers by FEM, FDM … for (i=0; i<N; i++) { for (k=Index(i-1); k<Index(i); k++{ • Memory-Bound Y[i]= Y[i] + A [k]*X[Item[k]]; } – indirect accesses } – Hybrid (OpenMP) is more memory-bound • Latency-Bound for Parallel Computations – comm.’s occurs only at domain boundaries – small amount of messages • Exa-scale Systems – O(10 8 ) cores – Communication Overhead by MPI Latency for > 10 8 -way MPI’s – Expectations for Hybrid • 1/16 MPI processes for T2K/Tokyo

  9. P2S2 9 Weak Scaling Results on ES GeoFEM Benchmarks [KN 2003] • Generally speaking, hybrid is better for large number of nodes • especially for small problem size per node – “less” memory bound 4.00 Large Flat MPI: Large ● Flat MPI Flat MPI: Small 3.00 Hybrid: Large ● Hybrid Hybrid: Small TFLOPS 2.00 Small ▲ Flat MPI 1.00 ▲ Hybrid 0.00 0 256 512 768 1024 1280 PE#

  10. P2S2 10 • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

  11. P2S2 11 Target Application • 3D Elastic Problems with Heterogeneous Material Property – E max =10 3 , E min =10 -3 , ν =0.25 • generated by “sequential Gauss” algorithm for geo-statistics [Deutsch & Journel, 1998] – 128 3 tri-linear hexahedral elements, 6,291,456 DOF • Strong Scaling • (SGS+CG) Iterative Solvers – Symmetric Gauss-Seidel – HID-based domain decomposition • T2K/Tokyo – 512 cores (32 nodes) • FORTARN90 (Hitachi) + MPI – Flat MPI, Hybrid (4x4, 8x2, 16x1)

  12. P2S2 12 HID: Hierarchical Interface Decomposition [Henon & Saad 2007] • Multilevel Domain Decomposition – Extension of Nested Dissection • Non-overlapped Approach: Connectors, Separators • Suitable for Parallel Preconditioning Method 0 2 2 2 3 3 3 2,3 1 2 2 2 3 3 3 2,3 Level-1 2 0,1 2 2 2 3 3 3 2,3 3 0,1 0,1 0,2 0,2 0,2 1,3 1,3 1,3 2,3 0,2 0,1 0 0 0 1 1 1 2,3 Level-2 2,3 level-1 :● 0 0 0 1 1 1 0,1 1,3 level-2 :● 0,1, Level-4 0 0 0 1 1 1 0,1 level-4 :● 2,3

  13. P2S2 13 Parallel Preconditioned Iterative Solvers on an SMP/Multicore node by OpenMP • DAXPY, SMVP, Dot Products – Easy • Factorization, Forward/Backward Substitutions in Preconditioning Processes – Global dependency – Reordering for parallelism required: forming independent sets – Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM) – Works on “Earth Simulator” [KN 2002,2003] • both for parallel/vector performance • CM-RCM (Cyclic Multi Coloring + RCM) – robust and efficient – elements on each color are independent

  14. P2S2 14 Ordering Methods 45 29 61 22 16 46 11 62 47 7 63 4 48 2 64 1 29 22 16 11 7 4 2 1 53 36 2 33 1 19 49 17 13 49 37 29 30 14 23 51 17 30 15 53 12 31 8 16 55 5 32 3 37 30 23 17 12 8 5 3 7 54 37 20 3 50 34 18 44 41 57 38 42 31 24 58 18 43 13 59 44 9 60 6 44 38 31 24 18 13 9 6 25 8 55 38 21 4 51 35 33 50 9 25 45 35 10 39 26 32 25 37 11 19 27 39 14 12 10 28 50 45 39 32 25 19 14 10 26 43 9 56 39 22 5 52 55 51 46 40 33 26 20 15 55 37 51 53 46 38 54 40 33 39 26 55 40 20 15 56 61 44 27 10 57 40 23 6 59 56 52 47 41 34 27 21 17 59 5 56 21 52 19 6 47 22 41 21 7 23 34 23 27 8 24 21 14 62 45 28 11 58 41 24 31 15 63 46 29 12 59 42 33 62 60 49 34 57 50 53 35 48 42 51 36 35 28 52 62 60 57 53 48 42 35 28 32 64 63 61 58 54 49 43 36 48 16 64 47 30 13 60 64 1 1 63 17 61 3 2 18 58 54 3 5 49 19 43 4 7 36 20 MC (Color#=4) RCM CM-RCM (Color#=4) Multicoloring Reverse Cuthill-Mckee Cyclic MC + RCM

  15. P2S2 15 Effect of Ordering Methods on Convergence 90 ▲ MC ● CM-RCM 80 Iterations 70 60 50 1 10 100 1000 color #

  16. P2S2 16 Re-Ordering by CM-RCM 5 colors, 8 threads Initial Vector Coloring color=1 color=2 color=3 color=4 color=5 (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Elements in each color are independent, therefore parallel processing is possible. => divided into OpenMP threads (8 threads in this case) Because all arrays are numbered according to “color”, discontinuous memory access may happen on each thread.

  17. P2S2 17 • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

  18. 18 Flat MPI, Hybrid (4x4, 8x2, 16x1) 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 Flat MPI Hybrid Hybrid Hybrid 16x1 4x4 8x2 P2S2

  19. P2S2 19 CASES for Evaluation • Focused on optimization of HB8x2, HB16x1 • CASE-1 – initial case (CM-RCM) – for evaluation of NUMA control effect • specifies local core-memory configulation • CASE-2 (Hybrid only) – First-Touch • CASE-3 (Hybrid only) – Further Data Reordering + First-Touch • NUMA policy (0-5) for each case

  20. P2S2 20 Results of CASE-1, 32 nodes/512cores computation time for linear solvers Normalized by Flat MPI (Policy 0) Policy 1.50 Command line switches ID policy 0 Relative Performance best (policy 2) 0 no command line switches 1.00 --cpunodebind=$SOCKET 1 --interleave=all --cpunodebind=$SOCKET 2 0.50 --interleave=$SOCKET --cpunodebind=$SOCKET 3 --membind=$SOCKET 0.00 --cpunodebind=$SOCKET Flat MPI HB 4x4 HB 8x2 HB 16x1 4 --localalloc Parallel Programming Models Best Policy 5 --localalloc Method Iterations CASE-1 Flat MPI 1264 2 HB 4x4 1261 2 HB 8x2 1216 2 e.g. mpirun – np 64 – cpunodebind 0,1,2,3 a.out HB 16x1 1244 2

  21. P2S2 21 First Touch Data Placement ref. “Patterns for Parallel Programming” Mattson, T.G. et al. To reduce memory traffic in the system, it is important to keep the data close to the PEs that will work with the data (e.g. NUMA control). On NUMA computers, this corresponds to making sure the pages of memory are allocated and “owned” by the PEs that will be working with the data contained in the page. The most common NUMA page-placement algorithm is the “first touch” algorithm, in which the PE first referencing a region of memory will have the page holding that memory assigned to it. A very common technique in OpenMP program is to initialize data in parallel using the same loop schedule as will be used later in the computations.

Recommend


More recommend