do we need dataflow programming
play

Do we need dataflow programming? Ant ntho hony Da y Dana nali - PowerPoint PPT Presentation

Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC1 16, C , Cha hateau d des Cont ntes Programming vs Execution Dataflow based execution


  1. Do we need dataflow programming? Ant ntho hony Da y Dana nali lis Innovative Computing Laboratory University of Tennessee CCDS DSC’1 ’16, C , Cha hateau d des Cont ntes

  2. Programming vs Execution ü Dataflow based execution ü Think ILP, Out of order execution ü Automatically derived by hardware/compiler/etc ü Dataflow programming ü Think Workflows ü Flow of data expli licitly ly specified by human

  3. Task-based vs Dataflow-based Is task execution the same thing? OpenMP StarPU PaRSEC *SS HPX 3 3

  4. Task-based vs Dataflow-based Is task execution the same thing? Developer specifies OpenMP Runtime derives dataflow dataflow StarPU PaRSEC *SS HPX 4 4

  5. Limits of deriving the dataflow P: nodes N: number of kernel executions Tk: kernel execution time To: overhead of discovery To*N << Tk*N/P => To*N <= 0.1*Tk*N/P => P <= 0.1*Tk/To To = 100ns, Tk = 100us => P <= 100 P <= 100 5 5

  6. Explicit Dataflow Programming Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage of without deriving information. 6 6

  7. Explicit Dataflow Programming Why does Explicit Dataflow Programming (EDP) differ from everything else? The human developer explicitly expresses the semantics of the algorithm/application in a way that the runtimes/compilers can directly take advantage of without deriving information. Benefits: Perfect Parallelism, Automatic Comm./Comp. overlap, Collective operation detection. 7 7

  8. Perf. Case study: NWChem CCSD DO {x4} Global work stealing CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() Allocate and initialize C CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Allocate and fetch A CALL MA_PUSH_GET(…,k_a) (same for B, not shown) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Actual work CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Push C back CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

  9. Structure of PTG computation

  10. CCSD Execution Time on 32 nodes 5000 Original 32 4500 PaRSEC 32 4000 3500 Execution Time (sec) 3000 2500 17 1703 2000 1500 818 81 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cores/Node

  11. Performance bottlenecks DO {x4} 1. Global atomic CALL nxt_ctx_next(ctx, icounter, next) 2. Coarse grain parallelism IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN 3. No opportunity for CALL MA_PUSH_GET(…,k_a) comm/comp overlap CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO

  12. Trace of Original code

  13. Trace of Original code (zoom)

  14. Trace of PaRSEC implementation

  15. Whose fault is the bad performance? Audience participation. Choose who to blame: • MPI • Developers • Programming paradigm (Coarse Grain Parallelism) • Vetter (for not telling his users about dataflow) 15 15

  16. Whose fault is it? Audience participation. Choose who to blame: • MPI • Developers • Programming paradigm (Coarse Grain Parallelism) • Vetter (for not telling his users about dataflow) MPI has a simple and an advanced API and many developers use only the simple one. - Rusty 16 16

  17. Message so far • Using CGP does not scale • Using Dataflow execution does • BUT, developers have to understand their code 17 17

  18. Sure, but can we make EDP easy? Can we make dataflow execution harness all the benefits without explicit dataflow programming? 18 18

  19. Sure, but can we make EDP easy? Can we make dataflow execution harness all the benefits without explicit dataflow programming? Yes, w , we c can. In s n. In some me c cases. M . Mayb ybe? 19 19

  20. Bridging Explicit & Implicit dataflow ü Reduce the cost of discovery Ø Code specialization • Developer expertise • Results of compiler analysis ü Harness benefits of parametric representation Ø Compress the Graph on the fly • Detect patterns in series that translate to expressions, or functions • Use compiler inserted hints 20 20

  21. Reduce the unnecessary discovery DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN CALL MA_PUSH_GET(…,k_a) CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 21 21

  22. Reduce the unnecessary discovery DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN Handle Generation CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Handle Generation CALL MA_PUSH_GET(…,k_a) Data Fetching CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Data Flushing CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 22 22

  23. Reduce the unnecessary discovery Pleasantly Parallel DO {x4} CALL nxt_ctx_next(ctx, icounter, next) IF ( (int_mb(…)+...).ne.8 ) THEN Handle Generation CALL MA_P _PUS USH_GE _GET() CA CALL LL DF DFIL ILL() DO {x2} IF ( (int_mb(…)+… .eq. int_mb(…) ) THEN Handle Generation CALL MA_PUSH_GET(…,k_a) Data Fetching CALL GE GET_H _HASH_B _BLOCK(d_a, dbl_mb(k_a), …) Insert_Task CALL DGE DGEMM(…) END IF END DO CALL TCE_S _SORT_4 _4(dbl_mb(k_c), …) Data Flushing CALL ADD_H DD_HASH_B _BLOCK(d_c, dbl_mb(k_c), …) END DO 23 23

  24. Dataflow between subroutines

  25. Code grouping based on dataflow

  26. Message so far • Discovering the whole DAG does not scale • Pruning the DAG requires human expertise • Compiler analysis can assist with pruning • BUT, developers have to understand their code 26 26

  27. Compressing the DAG to a PTG? for for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][n], INOUT); for for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); 27 27

  28. What does a DAG look like? POTRF T=>T T=>T T=>T TRSM T=>T C=>A TRSM C=>A TRSM C=>A SYRK C=>A TRSM C=>A C=>B C=>B C=>A C=>A C=>B C=>B T=>T C=>A C=>A C=>B C=>B C=>A GEMM GEMM GEMM POTRF GEMM GEMM GEMM SYRK C=>C SYRK C=>C T=>T T=>T T=>T C=>C SYRK T=>T C=>C TRSM T=>T TRSM C=>C C=>C TRSM T=>T C=>A C=>B C=>A C=>B C=>A C=>A C=>A C=>B C=>A SYRK GEMM SYRK GEMM GEMM SYRK T=>T C=>C POTRF C=>C T=>T T=>T T=>T C=>C T=>T TRSM TRSM C=>A C=>B C=>A C=>A SYRK GEMM SYRK T=>T C=>C POTRF T=>T T=>T TRSM C=>A SYRK T=>T POTRF 28 28

  29. Fully compressed DAG (PTG) {[k,m,n]->[k,m+1,n]:m<mt-1} {[k,m,n]->[k+1,m,n]:n>k+1 && m>k+1} GEQRT(k) TSMQR(k,m,n) {[k,m,n]->[n]:k+1==n && k+1==m} k = 0 .. mt-1 k = 0 .. mt-1 m = k+1 .. mt-1 n = k+1 .. mt-1 {[k,m,n]->[k+1,n]:k+1==m && n>m} {[k]->[k,n]:k<n<nt && k<nt-1} { [ k {[k,m]->[k,m,n]:k<nt-1 && k<n<nt} {[k,m,n]->[n,m]:k+1==n && m>n} ] - > [ k , k + 1 ] : k < = m t - 2 } {[k,n]->[k,k+1,n]:k<mt-1} {[k,m]->[k,m+1]:m<mt-1} UNMQR(k,n) TSQRT(k,m) k = 0 .. mt-1 k = 0 .. mt-1 m = k+1 .. mt-1 n = k+1 .. mt-1

  30. Compressing the DAG to a PTG? Task_A for (k = 0; k < MT; k++) { for Task_B Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { for Task_B Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, Task_B A[m][k], INOUT | LOCALITY, Task_C T[m][k], OUTPUT); Task_D } Task_D for for (n = k+1; n < NT; n++) { Task_D Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, … A[k][n], INOUT); for (m = k+1; m < MT; m++) { for Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); 30 30

Recommend


More recommend