easy programming of linear algebra operations on hybrid
play

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU - PowerPoint PPT Presentation

HPC & A Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S. Quintana-Ort 1 INRIA-Sophia Antipolis, June 2011 Index HPC & A The libflame library GPU support The StarSs framework 2


  1. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition+operation+merging A TL A 00 α 11 T a 10 → → A BL A BR A 20 A 22 a 21 A 00 A TL T √ α 11 a 10 → a 21 A 20 A 22 – A BL A BR / T a 21 a 21 α 11 28 INRIA-Sophia Antipolis, June 2011

  2. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: repartition A TL → A BL A BR A 00 T α 11 a 10 a 21 A 20 A 22 Indexing operations 29 INRIA-Sophia Antipolis, June 2011

  3. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: operation A 00 α 11 T a 10 → A 20 a 21 A 00 T √ α 11 a 10 a 21 A 20 A 22 – / T a 21 a 21 α 11 Real computation 30 INRIA-Sophia Antipolis, June 2011

  4. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm loop: merging A 00 √ α 11 T a 10 → a 21 A 20 A 22 – / / T T a a 21 a 21 a α 11 A TL A BL A BR Indexing operation 31 INRIA-Sophia Antipolis, June 2011

  5. libflame → Creating your own algorithm → FLAME notation and algorithms HPC & A Algorithm � Automatic development from math. specification A = L * L T Mechanical procedure 32 32 INRIA-Sophia Antipolis, June 2011

  6. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A APIs Spark+APIs C, F77, Matlab, LabView, LaTeX 33 INRIA-Sophia Antipolis, June 2011

  7. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Spark website http://www.cs.utexas.edu/users/flame/Spark/ 34 INRIA-Sophia Antipolis, June 2011

  8. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab [ ATL, ATR,... ABL, ABR ] = FLA_Part_2x2( A, 0, 0, 'FLA_TL' ); while ( size( ATL, 1 ) < size( A, 1 ) ) [ A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR,... ABL, ABR,... 1, 1, 'FLA_BR' ); %----------------------------------------% % : %----------------------------------------% [ ATL, ATR,... ABL, ABR ] = ... FLA_Cont_with_3x3_to_2x2( A00, a01, A02,... a10t, alpha11, a12t,... A20, a21, A22,... 'FLA_TL' ); end Indexing operations 35 INRIA-Sophia Antipolis, June 2011

  9. libflame → Creating your own algorithm → Spark: from algorithm to code HPC & A Example: FLAME@lab � Manually fill-in operations […] = FLA_Part_2x2(…); while ( size( ATL, 1 ) < size( A, 1 ) ) while ( size( ATL, 1 ) < size( A, 1 ) ) […] = FLA_Repart_2x2_to_3x3(…); %----------------------------------------% alpha11 = sqrt( alpha11 ); a21 = a21 / alpha11; A22 = A22 – tril( a21*a21’ ); %----------------------------------------% […] = FLA_Cont_with_3x3_to_2x2(…); end Real computation 36 INRIA-Sophia Antipolis, June 2011

  10. libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t,/**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*--------------------------------------*/ /* : */ /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t, /* ************** */ /* ************************/ &ABL, /**/ &ABR, A20, a21, /**/ A22, FLA_TL ); } 37 INRIA-Sophia Antipolis, June 2011

  11. libflame → Creating your own algorithm → Running on multicore HPC & A Example: FLAMEC � libflame employs external BLAS: GotoBLAS, MKL, ACML, ATLAS, netlib FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Sqrt( alpha11 ); FLA_Inv_scal( alpha11, a21 ); FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, a21, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 38 INRIA-Sophia Antipolis, June 2011

  12. Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 1. Task parallelism 3. FLAME runtime 2. 2. SuperMatrix SuperMatrix 4. 4. Cluster of GPUs Cluster of GPUs 3. GPU support � The SMPs/GPUSs framework 39 INRIA-Sophia Antipolis, June 2011

  13. Data-flow parallelism? Dynamic scheduling? HPC & A Run-time? � Surely not a new idea… � Cilk � StarSs (GridSs) � StarPU � … � “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, R. M. Tomasulo, IBM J. of R&D, Volume 11, Number 1, Page 25 (1967) The basis for exploitation of ILP on current superscalar processors! 40 INRIA-Sophia Antipolis, June 2011

  14. The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 41 INRIA-Sophia Antipolis, June 2011

  15. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 T A 22 := A 22 – L 21 L 21 � ������������� ������������� ������������� 42 42 INRIA-Sophia Antipolis, June 2011

  16. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Cholesky factorization A = L * L T FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); APIs + Tools /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 43 43 INRIA-Sophia Antipolis, June 2011

  17. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � Simple parallelization: T A 11 = L 11 L 11 -T A 21 := L 21 = A 21 L 11 link with MT BLAS T A 22 := A 22 – L 21 L 21 FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ FLA_Repart_2x2_to_3x3(…); FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 44 44 INRIA-Sophia Antipolis, June 2011

  18. libflame → FLAME runtime → Task parallelism HPC & A Blocked algorithms � There is more parallelism! Inside the same iteration In different iterations ������������� ������������� 45 45 INRIA-Sophia Antipolis, June 2011

  19. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies Super � Matrix 1 2 4 7 5 3 6 8 9 10 46 INRIA-Sophia Antipolis, June 2011

  20. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: automatic identification of tasks/dependencies HOW? � Input/output/input-output Super operands and order of Matrix operations in code determine dependencies � Direction of operands is defined /*--------------------------------------*/ FLA_Chol( FLA_LOWER_TRIANGULAR, A11 ); as part of BLAS specification FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ 47 INRIA-Sophia Antipolis, June 2011

  21. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores 1 2 4 7 Super 5 3 6 8 9 10 Matrix 48 INRIA-Sophia Antipolis, June 2011

  22. libflame → FLAME runtime → SuperMatrix HPC & A Exploiting task-level parallelism � SuperMatrix: scheduling of tasks to � cores HOW? 1 2 4 7 � List of ready tasks Super � One thread per core 5 3 6 8 9 10 Matrix 1. Centralized list 2. One list per-thread 3. One list per-thread and work- stealing 49 INRIA-Sophia Antipolis, June 2011

  23. libflame → FLAME runtime → GPU support HPC & A Single GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 50 INRIA-Sophia Antipolis, June 2011

  24. libflame → FLAME runtime → GPU support HPC & A Single GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 51 51 INRIA-Sophia Antipolis, June 2011

  25. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Indexing operations (with FLA_Repart_2x2_to_3x3(…); addresses in device /*--------------------------------------*/ memory) FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ FLA_Cont_with_3x3_to_2x2(…); } 52 52 INRIA-Sophia Antipolis, June 2011

  26. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); Real computation: FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); Runtime keeps track of data in /*--------------------------------------*/ host/device memory and FLA_Cont_with_3x3_to_2x2(…); performs the necessary transfers, reducing #copies } 53 53 INRIA-Sophia Antipolis, June 2011

  27. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Copy matrix to GPU /*--------------------------------------*/ memory before FLA_Cont_with_3x3_to_2x2(…); algorithm commences } 54 54 INRIA-Sophia Antipolis, June 2011

  28. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 2. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ device to host before its FLA_Cont_with_3x3_to_2x2(…); factorization } 55 55 INRIA-Sophia Antipolis, June 2011

  29. libflame → FLAME runtime → GPU support HPC & A Single GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, 3. Copy block A11 from FLA_MINUS_ONE, A21, FLA_ONE, A22 ); /*--------------------------------------*/ host to device before FLA_Cont_with_3x3_to_2x2(…); using it in subsequent } computations 56 56 INRIA-Sophia Antipolis, June 2011

  30. libflame → FLAME runtime → GPU support HPC & A Multi-GPU � SuperMatrix: Dealing with data transfers between host (CPU)/device � (GPU) memory spaces 1 2 4 7 Super 5 3 6 8 9 10 Matrix 57 INRIA-Sophia Antipolis, June 2011

  31. libflame → FLAME runtime → GPU support HPC & A Multi-GPU � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect 58 58 INRIA-Sophia Antipolis, June 2011

  32. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: a user’s view FLA_Obj A; // Initialize conventional matrix: buffer, m, rs, cs // Obtain storage blocksize, # of threads: b, n_threads FLA_Init(); FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A ); FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs, 0, 0, A ); FLASH_Queue_set_num_threads( n_threads ); FLASH_Queue_enable_gpu(); FLASH_Chol( FLA_LOWER_TRIANGULAR, A ); FLASH_Obj_free( &A ); FLA_Finalize(); 59 59 INRIA-Sophia Antipolis, June 2011

  33. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Naïve approach: ����������������������������������� � ������ GPU #0 ������������������������������ GPU #1 � PCI-e CPU(s) �������������������� �������������������� bus bus GPU #2 GPU #3 → ������������������ Inter- connect 60 60 INRIA-Sophia Antipolis, June 2011

  34. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � How do we program these? GPU #0 GPU #1 PCI-e CPU(s) bus bus GPU #2 GPU #3 Inter- connect View as a… � Shared-memory multiprocessor + DSM 61 61 INRIA-Sophia Antipolis, June 2011

  35. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � View system as a shared- memory multiprocessors (multi-core processor with GPU #0 GPU #1 hw. coherence) PCI-e CPU(s) bus bus GPU #2 MP P 0 +C 0 GPU #3 Inter- connect P 1 +C 1 P 2 +C 2 P 3 +C 3 62 62 INRIA-Sophia Antipolis, June 2011

  36. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Software Distributed-Shared Memory (DSM) � Software: flexibility vs. efficiency � Underlying distributed memory hidden from the users � Reduce memory transfers using write-back, write- invalidate,… � Well-known approach, not too efficient as a middleware for general apps. � Regularity of dense linear algebra operations makes a difference! 63 63 INRIA-Sophia Antipolis, June 2011

  37. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover � Reduce #data transfers: ������������������������������� � Super ��������� ���������! Matrix "��������������������� � #��� ���� → ����������� #��� ���� → ����������� � � $��������� � GPU #0 $��������������� � GPU #1 PCI-e CPU(s) bus GPU #2 GPU #3 Inter- connect 64 64 INRIA-Sophia Antipolis, June 2011

  38. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 1. Distribute matrix among /*--------------------------------------*/ GPU memories (2D workload distribution) FLA_Cont_with_3x3_to_2x2(…); before algorithm } commences 65 INRIA-Sophia Antipolis, June 2011

  39. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover GPU #0 GPU #1 Super Matrix GPU #2 GPU #3 GPU #3 1. Distribute matrix among GPU memories (2D workload distribution): owner-computes rule 66 INRIA-Sophia Antipolis, June 2011

  40. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 2. Copy block A11 from /*--------------------------------------*/ corresponding device to FLA_Cont_with_3x3_to_2x2(…); host before its factorization } 67 INRIA-Sophia Antipolis, June 2011

  41. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 3. Broadcast block A11 from /*--------------------------------------*/ host to appropriate FLA_Cont_with_3x3_to_2x2(…); devices before using it in } subsequent computations (write-update) 68 INRIA-Sophia Antipolis, June 2011

  42. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 4. Keep A11 in receiving /*--------------------------------------*/ device(s) in case needed FLA_Cont_with_3x3_to_2x2(…); in subsequent } computations (cache) 69 INRIA-Sophia Antipolis, June 2011

  43. libflame → FLAME runtime → GPU support HPC & A Multi-GPU: under the cover FLA_Part_2x2(…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){ Super Matrix FLA_Repart_2x2_to_3x3(…); /*--------------------------------------*/ FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 ); FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR, FLA_TRANSPOSE, FLA_NONUNIT_DIAG, FLA_ONE, A11, A21 ); FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, FLA_ONE, A22 ); 5. Keep updated A21 in /*--------------------------------------*/ device till replaced (write- FLA_Cont_with_3x3_to_2x2(…); back) } 70 INRIA-Sophia Antipolis, June 2011

  44. libflame → FLAME runtime → GPU support HPC & A Performance 71 INRIA-Sophia Antipolis, June 2011

  45. libflame → FLAME runtime → GPU support HPC & A Performance 72 INRIA-Sophia Antipolis, June 2011

  46. libflame → FLAME runtime → GPU support HPC & A Performance 73 INRIA-Sophia Antipolis, June 2011

  47. Index HPC & A � The libflame library 1. A user’s view 2. Creating your own algorithm 3. FLAME runtime 4. 4. Clusters of GPUs Clusters of GPUs 1. DLA for clusters 2. Host-centric view 3. Device-centric view � The StarSs framework 74 INRIA-Sophia Antipolis, June 2011

  48. libflame → Clusters of GPUs → DLA for clusters HPC & A libflame-like libraries � PLAPACK (UT@Austin) � Use of objects ( PLA_Obj ), vectors, matrices, projected vectors, etc., with layout embedded � PMB distribution � Layered and modular design: all communication is � Layered and modular design: all communication is done via copies ( PLA_Copy ) and reductions ( PLA_Reduce ) from one object type to another � Elemental (Jack Poulson) � Based on PLAPACK, but C++ � Element-wise cyclic data layout 75 75 INRIA-Sophia Antipolis, June 2011

  49. libflame → Clusters of GPUs → Host-centric view HPC & A Data in host memory � Before executing a kernel, copy input data to GPU memory � After execution, retrieve results back to node main results back to node main memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to kernels) CPU(s) bus � Copies linked to kernel GPU #2 execution: O(n 3 ) transfers GPU #3 Inter- between CPU and GPU connect 76 76 INRIA-Sophia Antipolis, June 2011

  50. libflame → Clusters of GPUs → Device-centric view HPC & A Data in GPU memory � Before sending a piece of data, retrieve it back to node main memory (compact on the fly) � After reception, copy � After reception, copy contents to GPU memory GPU #0 � Easy to program (wrappers GPU #1 PCI-e to MPI calls) CPU(s) bus � Copies linked to GPU #2 communication, not kernel GPU #3 Inter- execution: O(n 2 ) transfers connect between CPU and GPU 77 77 INRIA-Sophia Antipolis, June 2011

  51. libflame → Clusters of GPUs HPC & A Performance 22x 10x 5x 78 78 INRIA-Sophia Antipolis, June 2011

  52. libflame → Clusters of GPUs HPC & A Performance 79 79 INRIA-Sophia Antipolis, June 2011

  53. Acknowledgements HPC & A � Funding sources 80 80 INRIA-Sophia Antipolis, June 2011

  54. Further information HPC & A � Contact: � field@cs.utexas.edu � FLAME project website: � www.cs.utexas.edu/users/flame/ � www.cs.utexas.edu/users/flame/ � libflame: The Complete Reference � www.cs.utexas.edu/users/field/docs/ � Updated nightly � www.lulu.com/content/5915632 � Updated occasionally 81 81 INRIA-Sophia Antipolis, June 2011

  55. Index HPC & A � The libflame library GPU support � The StarSs framework 82 INRIA-Sophia Antipolis, June 2011

  56. The TEXT project HPC & A � Towards Exaflop applicaTions � Demonstrate that Hybrid MPI/SMPSs addresses the Exascale challenges in a productive and efficient way. � � Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC Deploy at supercomputing centers: Julich, EPCC, HLRS, BSC � Port Applications (HLA, SPECFEM3D, PEPC, PSC, BEST, CPMD, LS1 MarDyn) and develop algorithms. � Develop additional environment capabilities � tools (debug, performance) � improvements in runtime systems (load balance and GPUSs) � Support other users � Identify users of TEXT applications � Identify and support interested application developers � Contribute to Standards (OpenMP ARB, PERI-XML) 83 INRIA-Sophia Antipolis, June 2011

  57. Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs Slides from Rosa M. Badia Barcelona Supercomputing Center Thanks! 84 INRIA-Sophia Antipolis, June 2011

  58. StarSs → StarSs overview HPC & A Programming model ParallelResources (multicore, SMP, cluster, c loud, grid) Synchronization, Task selection + results transfer Resource 1 Sequential Application parameters direction ... Resource 2 (input, output, inout) for (i=0; i<N; i++){ T1 ( data1, data2); Resource 3 T2 ( data4, data5); T3 ( data2, data5, data6); T4 ( data7, data8); . . T5 ( data6, data8, data9); T5 ( data6, data8, data9); . } ... . T1 0 T2 0 Resource N T4 0 T3 0 Scheduling, T5 0 Task graph creation T1 1 T2 1 data transfer, based on data T4 1 T3 1 task execution precedence T5 1 T1 2 … 85 INRIA-Sophia Antipolis, June 2011

  59. StarSs → StarSs overview HPC & A Programming model StarSs GridSs CellSs SMPSs ClusterSs GPUSs ClusterSs ClearSpeedSs OmpSs COMPSs @ SMP @ GPU @ Cluster � Programmability/Portability � Incremental parallelization/restructure � StarSs � Separate algorithm from resources � � Disciplined programming A “node” level programming model � � Sequential C/Fortran/Java + annotations “Same” source code runs on “any” machine � Optimized task implementations will result in better performance. � Task based. Asynchrony, data-flow. � Performance � “Simple” linear address space � Intelligent Runtime � Directionality annotations on tasks arguments � Automatically extracts and exploits parallelism � Dataflow, workflow � Nicely integrates in hybrid MPI/StarSs � Matches computations to specific resources on each type of � Natural support for heterogeneity target platform � Asynchronous (data-flow) execution and locality awareness 86 INRIA-Sophia Antipolis, June 2011

  60. StarSs → StarSs overview A sequential program… HPC & A void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 87 INRIA-Sophia Antipolis, June 2011

  61. StarSs → StarSs overview A sequential program… taskified… HPC & A #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); Compute dependences @ task instantiation time #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: order of task instantiation vadd3 (&G[i], &F[i], &E[i]); Some antidependences covered by flow dependences not drawn 88 INRIA-Sophia Antipolis, June 2011

  62. StarSs → StarSs overview A sequential program… taskified… with data-flow HPC & A execution Decouple how we write from #pragma css task input(A, B) output(C) Write how it is executed void vadd3 (float A[BS], float B[BS], float C[BS]); Execute #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); 1 2 3 4 #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B 5 7 8 6 vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) // sum(C[i]) 9 10 11 12 accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A 13 14 15 16 scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); 17 18 19 20 ... for (i=0; i<N; i+=BS) // E=G+F Color/number: a possible order of task execution vadd3 (&G[i], &F[i], &E[i]); 89 INRIA-Sophia Antipolis, June 2011

  63. StarSs → StarSs overview HPC & A The potential of data access information � Flexibility to dynamically traverse dataflow graph “optimizing” � Concurrency. Critical path � Memory access: data transfers performed by run time � Opportunities for � Prefetch � Reuse � Eliminate antidependences (rename) � Replication management � Coherency/consistency handled by the runtime 90 INRIA-Sophia Antipolis, June 2011

  64. Index HPC & A � The libflame library � The StarSs framework 1. StarSs overview 2. OmpSs 1. Overview & syntax 2. Compiler 3. Runtime 4. Examples 91 INRIA-Sophia Antipolis, June 2011

  65. StarSs → OmpSs → Overview & syntax HPC & A OmpSs = OpenMP + StarSs extensions � OmpSs is based on OpenMP with some differences: � Different execution model � Extended memory model � Extensions for point-to-point inter-task synchronizations � data dependencies � Extensions for heterogeneity � Other minor extensions 92 INRIA-Sophia Antipolis, June 2011

  66. StarSs → OmpSs → Overview & syntax HPC & A Execution model � Thread- pool model � OpenMP parallel “ignored” � All threads created on startup � One of them starts executing main � All get work from a task pool � And can generate new work 93 INRIA-Sophia Antipolis, June 2011

  67. StarSs → OmpSs → Overview & syntax HPC & A Memory model � Two “modes“ are allowed: � pure SMP: � Single address space � OpenMP standard memory model is used � � non-SMP (cluster, GPUs, ...): � Multiple address spaces exists � Same data may exists in multiple of these � Data consistency ensured by the implementation 94 INRIA-Sophia Antipolis, June 2011

  68. StarSs → OmpSs → Overview & syntax HPC & A Main element: Task � Task: unit of computation � Task definition � Pragmas in lined � Pragmas attached to function definition � Pragmas attached to function definition #pragma omp task void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100); } 95 INRIA-Sophia Antipolis, June 2011

  69. StarSs → OmpSs → Overview & syntax HPC & A Defining dependences � Clauses that express data direction: � input � output � inout � Dependences computed at runtime taking into account � Dependences computed at runtime taking into account these clauses #pragma omp task output( x ) 1 x = 5; #pragma omp task input( x ) 3 printf("%d\n" , x ) ; #pragma omp task inout( x ) 2 x++; #pragma omp task input( x ) printf ("%d\n" , x ) ; 4 96 INRIA-Sophia Antipolis, June 2011

  70. StarSs → OmpSs → Overview & syntax HPC & A Heterogeneity: the target directive � Directive to specify device specific information: #pragma omp target [ clauses ] � Clauses: � device: which device (smp, gpu) � copy_in, copy_out, copy_inout: data to be moved in and out � implements: specifies alternate implementations #pragma target device (smp) #pragma omp task input (Y) void foo (int Y[size], int size) { int j; for (j=0; j<size; j++) Y[j]= j; } int main() { int X[100] foo (X, 100) ; } 97 INRIA-Sophia Antipolis, June 2011

  71. StarSs → OmpSs → Overview & syntax HPC & A Synchronization #pragma omp task wait � Suspends the current task until all children tasks are completed � Just direct children, not descendants void traverse_list ( List l ) { Element e ; Element e ; for ( e = l-> first; e ; e = e->next ) #pragma omp task process ( e ) ; #pragma omp taskwait } 2 ... 1 3 4 98 INRIA-Sophia Antipolis, June 2011

  72. StarSs → OmpSs → Overview & syntax HPC & A Hierarchical task graph � Nesting #pragma omp task input([BS][BS]A, [BS][BS]B)\ inout([BS][BS]C) void small_dgemm(float *C, float *A, float *B); #pragma omp task input([N][N]A, [N][N] B)\ inout([N][N]C) inout([N][N]C) void block_dgemm(float *C, float *A, float *B){ int i, j, k; for (i=0; i< N; i+=BS) for (j=0; j< N; j+=BS) for (k=0; k< N; k+=BS) small_dgemm(&C[i][j], &A[i][k], &B[k][j]) } main() { ... block_dgemm(A,B,C); block_dgemm(D,E,F); #pragma omp task wait } 99 INRIA-Sophia Antipolis, June 2011

  73. StarSs → OmpSs → Compiler HPC & A Mercurium � Minor role � Recognizes constructs and transforms them to calls to the runtime � Manages code restructuring for different target devices � Device-specific handlers � May generate code in a separate file � Invokes different back-end compilers → nvcc for NVIDIA 100 INRIA-Sophia Antipolis, June 2011

Recommend


More recommend