linpack evaluation on linpack evaluation on a
play

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p - PowerPoint PPT Presentation

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous Accelerators Toshio Endo, Akira Nukada Satoshi Matsuoka, Naoya Maruyama y y Tokyo Institute of Technology, Japan IPDPS 2010, Atlanta GPU/Accelerators for


  1. Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous Accelerators Toshio Endo, Akira Nukada Satoshi Matsuoka, Naoya Maruyama y y Tokyo Institute of Technology, Japan IPDPS 2010, Atlanta

  2. GPU/Accelerators for High Performance Computing � In HPC systems, power consumption has been/will remain a major concern � GPU and Accelerators are promising for their excellent Flops/Watt ratio ClearSpeed NVidia GeForce ATI Radeon HD X620 X620 GTX285 GTX285 4870 4870 Speed (SP) 1063GFlops 1200GFlops Speed (DP) p ( ) 80GFlops p 88GFlops p 240GFlops p Memory BW 6.4GB/s 159GB/s 115GB/s Power 25W 183W 160W

  3. Heterogeneous Systems Heterogeneous Systems Heterogeneous architectures that combines general g g purpose CPUs and accelerators will be attractive for � Generality by general purpose CPUs • Typically x86/x86-64 CPUs � Higher Flops/Watt ratio by accelerators • GPUs, Cell processor, ClearSpeed… Example: • LANL RoadRunner: 1.4PF with 12240 PowerXCell8i • NUDT Tianhe-1: 1.2PF with 5120 Radeon HD 4870 • TokyoTech TSUBAME: 160TF with 680 Tesla S1070 GPUs+648 ClearSpeed

  4. Our Contribution Ou Co bu o � Demonstrated scalability of a heterogeneous system, TSUBAME � A Linpack implementation that uses cooperatively: • 10,368 Opteron cores p • 612 Tesla GPUs • 648 ClearSpeed accelerators • 640 Xeon � A different strategy than on Roadrunner or Tianhe-1 is required required � 87.01TFlops � #56 in Top500 ranking + +

  5. LANL RoadRunner (2008) LANL RoadRunner (2008) The largest heterogeneous system The first PetaFlops machine in the world! ld! � 6120 dual-core Opterons and 12240 PowerXCell 8i 12240 P XC ll 8i � IBM blades P Peak performance is 1.4PFlops k f i 1 4PFl � >90% comes from Cell #2 in Top500 ranking Linpack performance is 1.042PFlops

  6. Tokyo-Tech TSUBAME Supercomputer Tokyo-Tech Supercomputer and Supercomputer and UBiquitously Accessible Mass storage Mass-storage Environment 燕 “TSUBAME” also means “swallow”, the symbol mark of Tokyo-Tech the symbol mark of Tokyo Tech

  7. TSUBAME Basic Data TSUBAME Basic Data � 655-node Linux cluster • Sun Fire X4600 • 8 Dual-core Opteron 880 (=16cores) per node • 32GB DDR memory per node • And Tesla S1070 GPU and ClearSpeed accelerators � ~1.1MW power consumption, 350 m 2 footprint p p , p � SUSE Linux Enterprise 10 � Jobs are managed by a batch scheduler � Jobs are managed by a batch scheduler • A customized version of Sun N1 Grid Engine � A production system used by >1,500 users A d ti t d b >1 500

  8. Accelerators Installed (1): NVIDIA Tesla S1070 � 4GPUs in 1U box • 800 watts/box � Each GPU has: • 30 Multi Processors x 8 Stream processors • 86GFlops (double prec) • 4GB GDDR3 memory • 4GB GDDR3 memory • 102GB/s memory bandwidth Tesla S1070 box � Connected with hosts via external PCI-Express cables � Connected with hosts via external PCI Express cables • 2 GPUs hang on a cable � Programming with CUDA programming language g g p g g g g � � 320 out of 655 TSUBAME nodes are connected with 2 GPUs respectively p y • ‘Inter-node’ heterogeneity

  9. Accelerators Installed (2): ClearSpeed X620 Accelerator � PCI-X board • 2 CSX600 x 96 SIMD cores • 80GFlops (double prec) • 80GFlops (double prec) • 1GB DDR memory • 6.4GB/s memory bandwidth y • 25 watts /board � Programming with ClearSpeed C n programming language � Each TSUBAME node has a board

  10. TSUBAME Node with Hybrid Accelerators Other nodes Other nodes ClearSpeed SDR InfiniBand 8 dual-core 1GB/s x 2 O t Opteron CPUs CPU (16 cores) PCI-X PCI e gen1 x8 PCI-e gen1 x8 1GB/s 1GB/s 2GB/s 2GPUs of Tesla 32GB memory 32GB memory SunFire X4600

  11. History of TSUBAME in Top500 History of TSUBAME in Top500 Jun06 Nov06 Jun07 Nov07 Jun08 Nov08 Jun09 Nov09 Linpack 38.18 47.38 48.88 56.43 67.70 77.48 87.01 Speed (TF) ( ) Rank 7 9 14 16 24 29 41 56 Opteron Opteron CS x 648 CS x 360 Xeon eo Tesla � The 3 rd system as a heterogeneous system y g y � • From Nov 06 to Nov 07, it was the 1 st � Continuous improvement for 7 times � Continuous improvement for 7 times

  12. What is Linpack? What is Linpack? � A numerical benchmark used in Top500 p supercomputer ranking (www.top500.org) • Solves a dense linear equation Ax = b of order N q • A direct solver; total computation cost is O(N 3 ) • Users can configure N; In TSUBAME, N~1,000,000 g ; , , , � HPL (High-performance Linpack) by A. Petitet • A famous MPI parallel implementation designed for • A famous MPI parallel implementation, designed for uniform systems • Based on blocked LU-decomposition, with partial pivoting Based on blocked LU decomposition, with partial pivoting • The most time consuming part is matrix-multiplication (DGEMM) ( ) • Used as a basis of our implementation

  13. HPL Algorithm HPL Algorithm LU decomposition of N × N matrix A for (k = 0; k < N; k += B) for (k 0; k < N; k + B) U U Panel factorization with partial UU pivoting to obtain L p g U U A’ N A L Broadcast L U A’ L A’ A Row exchange, and compute U Row exchange, and compute U L L L L A’ L A’ Update the rest part of matrix = − × A ' A ' L U B DGEMM is the most time consuming i

  14. Data Decomposition in HPL Data Decomposition in HPL � Matrix A is uniformly distributed with 2D block- cyclic distribution among processes Matrix distribution on Matrix distribution on Each process has a Each process has a 6 (=2x3) processes “partial-matrix” U L A L A L L N = − × A ' A ' L U L L L L B

  15. Design Issues on Heterogeneous Systems � Who computes? • Kernel (DGEMM, DTRSM ) ( , ) • Accelerators? Both CPU and accelerators? • Non-kernel Non kernel � Where are matrix data placed? • Host memory? Accelerator memory? • Strategies depend on system architecture g p y • We compare our decision with that on Roadrunner [PPoPP09] [ ] • More challenging on TSUBAME

  16. Who Computes? Who Computes? � Non-kernel Breakdown of peak performance (DP) peak performance (DP) • Only CPUs are used for MPI • Only CPUs are used for MPI per processor type communication, pivoting… RR TSUBAME � Kernel functions � Kernel functions 100% 100% 90% • On Roadrunner, Cells contribute 53.9 80% 96% of performance 70% 60% • Ratio of CPUs is 4% 1410 52.2 50% ⇒ Only Cells are used 40% 7.3 7.3 30% 30% • On TSUBAME, CPUs contribute O TSUBAME CPU ib 20% 49.8 35% 10% 46.7 0% • Omitting any type of processors Omitting any type of processors Roadrunner TSUBAME heavily degrades performance Total 1457TF Total 163.2TF ⇒ All of CPUs,GPUs,ClearSpeed Opteron Xeon ClearSpeed are used d Tesla Tesla Cell Cell

  17. Where are matrix data placed? (1) Where are matrix data placed? (1) A RR node A RR node A TSUBAME node A TSUBAME node 16GB Host memory 32GB CPUs Cell Tesla Clear Accelerators Speed Device memory Device memory 4GB 4GB 1GB Host mem : Device mem Host mem : Device mem Host mem : Device mem Host mem : Device mem 16GB = 4GB x 4 32GB > 4GBx2 + 1GB

  18. Where are matrix data placed? (2) Where are matrix data placed? (2) � In Linpack, the matrix size should be larger to gain speed in Flops • ⇒ it should be as large as host memory � On RR, • (1) Device memory = Host memory • (2) Kernel computation is done only by Cells ( ) p y y ⇒ Matrix data are on Cell device memory � On TSUBAME � On TSUBAME, • Device memory < Host memory ⇒ Matrix data are usually on host memory ⇒ Matrix data are usually on host memory

  19. Executing Kernel Functions on Accelerators PCI-e/ � Matrix data is on host memory, when PCI-X DGEMM function is called � Pipelined DGEMM execution: � Pipelined DGEMM execution: A part of input data is moved from host (1) (1) Input data to device Computes DGEMM on accelerators (2) The results are moved back to host, (3) (2) calc DGEMM() then repeats for next partial matrix then repeats for next partial matrix (3) Output data M More frequent and larger amount of PCI-e/PCI-X communications are f t d l t f PCI /PCI X i ti required than on RR

  20. Challenging Issues on TSUBAME � Intra-node heterogeneity: • CPU/GPU/ClearSpeed are used for kernel • On RR, using only Cell is sufficient � Inter-node heterogeneity: • Half the nodes have GPUs, while others don’t Half the nodes have GPUs, while others don t • On RR, nodes are uniform � Frequent PCI-e/PCI-X communication: � Frequent PCI-e/PCI-X communication: • The whole input/output is moved via PCI • On RR matrix data always resides in Cell device • On RR, matrix data always resides in Cell device memory H How can we run HPL, originally designed for uniform HPL i i ll d i d f if systems, efficiently?

  21. Coping with intra-node Heterogeneity Coping with intra node Heterogeneity � We ‘virtualize’ heterogeneous processors at BLAS layer � Processors are providers of DGEMM performance � We control mapping between processes and processors • An MPI process divides its own sub-matrix with a proper ratio and throws DGEMM tasks to CPUs and accelerators • All processes should be mapped with processors of similar All processes should be mapped with processors of similar performance Example of mapping during DGEMM Example of mapping during DGEMM Processes Processors

Recommend


More recommend