on a cluster
play

on a Cluster Hongbo Rong, Frank Schlimbach Programming & - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program


  1. Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015

  2. Problem A productivity program running on a cluster 2

  3. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert 2

  4. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries 2

  5. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end 2

  6. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end How to compose these non-composable library functions automatically ? 2

  7. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  8. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  9. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  10. Flow Graphs Traditional: Bulk-synchronous Parallel Pieplined & asynchronous: Communication Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  11. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D

  12. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E

  13. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E

  14. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E Execution Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI – n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI – n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI – n 1000 ./matlab user_script.m

  15. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

  16. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Multiply Graph 1 C1*

  17. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1*

  18. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1* D Multiply Graph 2 E1*

  19. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. D Multiply Graph 2 E1*

  20. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B … Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. … D Multiply Graph 2 E1*

  21. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1*

  22. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1* E

  23. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) 6

  24. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler 6

  25. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

  26. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

  27. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } 6

  28. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

  29. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection Host Ho t dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } la langu nguage age void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface C initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

  30. Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) 7

  31. Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) Domain-expert written graphs  High-level algorithms for library functions  Input/output collections can be from outside 7

  32. Advantages Useful for any language  Mature work in compiler/interpreter – Dataflow analysis, pattern matching, code replacement Extends a scripting language to distributed computing implicitly  Transparent to users  Transparent to the language  Transparent to libraries Heavy lifting done in CnC and graph writing by domain experts 8

  33. Open questions Minimize communication  Item collections: consumed_on  Step collections: computed_on Scalability Applications  There might not be many long sequences of library calls 9

Recommend


More recommend