Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015
Problem A productivity program running on a cluster 2
Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert 2
Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed libraries 2
Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed libraries Library functions are not composable Black box Independent Context-unaware Barrier at the end 2
Problem A productivity program running on a cluster The programmer is a domain expert, but not a tuning expert Call distributed libraries Library functions are not composable Black box Independent Context-unaware Barrier at the end How to compose these non-composable library functions automatically ? 2
Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3
Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3
Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3
Flow Graphs Traditional: Bulk-synchronous Parallel Pieplined & asynchronous: Communication Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3
Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D
Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E
Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E
Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E Execution Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI – n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI – n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI – n 1000 ./matlab user_script.m
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Multiply Graph 1 C1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1* D Multiply Graph 2 E1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. D Multiply Graph 2 E1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B … Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. … D Multiply Graph 2 E1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1*
Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1* E
Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) 6
Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler 6
Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6
struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6
struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } 6
struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6
struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection Host Ho t dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } la langu nguage age void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface C initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6
Key points Compiler Generates a context and an interface for a dataflow Connects expert-written graphs into a pipeline Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) 7
Key points Compiler Generates a context and an interface for a dataflow Connects expert-written graphs into a pipeline Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) Domain-expert written graphs High-level algorithms for library functions Input/output collections can be from outside 7
Advantages Useful for any language Mature work in compiler/interpreter – Dataflow analysis, pattern matching, code replacement Extends a scripting language to distributed computing implicitly Transparent to users Transparent to the language Transparent to libraries Heavy lifting done in CnC and graph writing by domain experts 8
Open questions Minimize communication Item collections: consumed_on Step collections: computed_on Scalability Applications There might not be many long sequences of library calls 9
Recommend
More recommend