dague
play

DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu - PowerPoint PPT Presentation

2012 Scheduling Workshop, Pittsburgh, PA DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra DAGuE DAGuE [dag] (like in Prague [prag]) Not DAGuE like ragout [rgo o] (the Prague


  1. 2012 Scheduling Workshop, Pittsburgh, PA DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra

  2. DAGuE • DAGuE [dag] (like in Prague [prag]) • Not DAGuE like ragout [rágo ͞ o] (the Prague Astronomical Clock was first installed in 1410, making it the third-oldest • Not DAGuE like vague [väg] astronomical clock in the world and the oldest • Innovative Computing Laboratory, one still working. -- Wikipedia Notice) University of Tennessee, Knoxville • Task / Data Flow Computation Framework • Dynamic Scheduling • Symbolic DAG representation • Distributed Memory • Many-core / Accelerators

  3. Motivation • Today software developers face systems with • ~1 TFLOP of compute power per node • 32+ of cores, 100+ hardware threads • Highly heterogeneous architectures (cores + specialized cores + accelerators/coprocessors) • Deep memory hierarchies • Today, we deal with thousands of them (plan to deal with millions) • ! systemic load imbalance / decreasing use of the resources • How to harness these devices productively? • SPMD produces choke points, wasted wait times • We need to improve efficiency, power and reliability

  4. How to Program • Threads & synchronization | Processes & Messages • Hand written Pthreads, compiler-based OpenMP, Chapel, UPC, MPI, hybrid • Very challenging to find parallelism, to debug, to maintain and to get good performance • Portably • With reasonable development efforts When is it time to redesign a software? • Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

  5. Goals Decouple “System issues” from Algorithm • Keep the algorithm as simple as possible Language • Depict only the flow of data between tasks • Distributed Dataflow Environment based on Dynamic Scheduling of (Micro) Tasks • Programmability: layered approach • Algorithm / Data Distribution • Parallel applications without parallel programming • Portability / Efficiency System • Use all available hardware; overlap data movements / computation • Find something to do when imbalance arise

  6. Dataflow with Runtime scheduling • Algorithms expect help to abstract • Hardware specificities : a runtime can provide portability, performance, scheduling heuristics, heterogeneity management, data movement, … • Scalability : maximize parallelism extraction, but avoid centralized scheduling or entire DAG representation: dynamic and independent discovery of the relevant portions during the execution • Jitter resilience : Do not support explicit communications, instead make them implicit and schedule to maximize overlap and load balance • ! express the algorithms differently

  7. DPOTRF performance problem scaling DGEQRF performance problem scaling DGETRF performance problem scaling 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 7 7 7 Theoretical peak Theoretical peak Theoretical peak Practical peak (GEMM) Practical peak (GEMM) Practical peak (GEMM) 6 6 6 DAGuE DAGuE DAGuE DSBP ScaLAPACK HPL ScaLAPACK ScaLAPACK 5 5 5 TFlop/s TFlop/s 4 4 TFlop/s 4 3 3 3 2 2 2 1 1 1 0 0 0 107k 120k 130k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k Matrix size (N) Matrix size (N) Matrix size (N) 81 dual Intel Xeon L5420@2.5GHz [22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨ om. Distributed (2x4 cores/node) ! 648 cores DSBP SBP cholesky factorization algorithms with near-optimal schedul- ing. ACM Trans. Math. Softw. , 36(2):1–25, 2009. ISSN 0098-3500. MX 10Gbs, Intel MKL, Scalapack DOI: 10.1145/1499096.1499100.

  8. DPOTRF performance problem scaling DGEQRF performance problem scaling DGETRF performance problem scaling 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 648 cores (Myrinet 10G) 7 7 7 Hardware Theoretical peak Theoretical peak Theoretical peak Practical peak (GEMM) Practical peak (GEMM) aware Practical peak (GEMM) 6 6 6 DAGuE DAGuE DAGuE scheduling DSBP ScaLAPACK HPL ScaLAPACK ScaLAPACK 5 5 5 Competes with TFlop/s TFlop/s 4 4 TFlop/s 4 Hand tuned 3 3 3 Extracts more 2 2 2 parallelism Change of the 1 1 data layout 1 (static task 0 0 0 scheduling) 107k 120k 130k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k 13k 26k 40k 53k 67k 80k 94k Matrix size (N) Matrix size (N) Matrix size (N) 81 dual Intel Xeon L5420@2.5GHz [22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨ om. Distributed (2x4 cores/node) ! 648 cores DSBP SBP cholesky factorization algorithms with near-optimal schedul- ing. ACM Trans. Math. Softw. , 36(2):1–25, 2009. ISSN 0098-3500. MX 10Gbs, Intel MKL, Scalapack DOI: 10.1145/1499096.1499100.

  9. The DAGuE framework Extensions Domain Specific Dense LA … Sparse LA Tools Runtime Parallel Data Symbolic Scheduling Movement Representation Hardware Memory Data Cores Coherence Accelerators Hierarchies Movement

  10. Domain Specific Extensions • DSEs � higher productivity for developers • High-level data types & ops tailored to domain • E.g., relations, matrices, triangles, … • Prototyping / Meta-Programming • Portable and scalable specification of parallelism • Automatically adjust data structures, mapping, and scheduling as systems scale up • Toolkit of classical data distributions, etc

  11. DAGuE toolchain Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain A P L A S M M A G M A

  12. Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE DAGuE Compiler MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain P L A S M A M A G M A Serial Code to Dataflow Representation

  13. Example: QR Factorization FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 GEQRT A[k][k]|Up, A[m][k], T[m][k] <- TSQRT TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 UNMQR A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) TSMQR FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

  14. Input Format – Quark (PLASMA) for (k = 0; k < A.mt; k++) { • Sequential C code Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { • Annotated through Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, QUARK-specific syntax T[m][k], OUTPUT); } • Insert_Task f or (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, • INOUT, OUTPUT, INPUT T[k][k], INPUT, • REGION_L, REGION_U, A[k][m], INOUT); REGION_D, … for (m = k+1; m < A.mt; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, • LOCALITY A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }

  15. Data Application code & distribution Codelets Dataflow Analysis Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel MEM compiler compiler tasks stubs Incoming Data Additional k = SIZE-1 Outgoing Data libraries Serial DAGuE MPI Code compiler pthreads Runtime CUDA k = 0 DAGuE Toolchain PLASMA MAGMA FOR k = 0 .. SIZE - 1 • data flow analysis A[k][k], T[k][k] <- GEQRT( A[k][k] ) • Example on task DGEQRT of FOR m = k+1 .. SIZE - 1 QR UPPER • Polyhedral Analysis through A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) Omega Test • Compute algebraic FOR n = k+1 .. SIZE - 1 expressions for: LOWER A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) • Source and destination FOR m = k+1 .. SIZE - 1 tasks n = k+1 • Necessary conditions for m = k+1 A[k][n], A[m][n] <- that data flow to exist TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

  16. Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ GEQRT : A(k, k) RW A <- (k == 0) ? A(k, k) TSQRT : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] UNMQR -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) TSMQR -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY Control flow is eliminated, therefore maximum parallelism is possible zgeqrt( A, T ) END

  17. Data Application code & distribution Codelets Programmer Domain Specific Dataflow Extensions representation Supercomputer System Dataflow Parallel compiler compiler tasks stubs Additional libraries Serial DAGuE JDF MPI Code compiler Runtime pthreads CUDA DAGuE Toolchain P L A S M A M A G M A Dataflow Representation

  18. Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)

More recommend