Automatic MPI application transformation with ASPhALT Anthony Danalis Lori Pollock Martin Swany University of Delaware University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Problem Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Automatic cake making machine Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Proposed Solution: An automatic automatic system that transforms transforms simple Automatic cake communication code into efficient code. making machine Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Proposed Solution: An automatic automatic system that transforms transforms simple Automatic cake communication code into efficient code. making machine Side-effect: Enables legacy parallel MPI applications legacy parallel MPI applications to scale, even if written without any knowledge of this system without any knowledge Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Application Cluster Layers ASPhALT: Information from multiple Runtime Libraries layers contributes to source optimization Operating System/ Network Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Our Framework : ASPhALT * Original Executable Code Source to Source Optimized Existing Optimizer Application Source Code Compiler Analyzer Low Level System System Comm. API Benchmarks Parameters Operating System/ Application Runtime Libraries Network Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Comm. Aggregation vs. Performance Traditional Approach: Low Overhead High High + Communication Aggregation Performance High Bandwidth Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Comm. Aggregation vs. Performance Traditional Approach: Low Overhead High High + Communication Aggregation Performance High Bandwidth Why our communication segmentation works? High Overhead on the network not the CPU High Fine Grain Application Communication Low Bandwidth but Performance transfer Overlapped i.e. CPU not idle Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Transformer Prototype Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel ( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) END IF END DO Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 P 1 kernel ( sArray[ : , I ], ... ) END DO P 2 DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) P 1 P 2 END IF END DO Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 P 1 kernel ( sArray[ : , I ], ... ) END DO P 2 DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) P 1 P 2 END IF END DO Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K After FORTRAN compiler DO P = 1, NPROC S = F( NX, P, NPROC ) TEMP1[ : ] = rArray[ S:E, T:T+K-1] E = G( NX, P, NPROC ) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) asynchRecvInit ( TEMP1[ : ] ) END DO rArray[ S:E, T:T+K-1] = TEMP1[ : ] DO I = T, T+K-1 kernel ( sArray[ : , I ], ... ) Array slice means implicit END DO copy for data to be contiguous DO P = 1, NPROC S = F( NX, P, NPROC ) TEMP2[ : ] = sArray[ S:E, T:T+K-1] E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) asynchSend ( TEMP2[ : ] ) END DO sArray[ S:E, T:T+K-1] = TEMP2[ : ] IF( T/K > D ) THEN wait ( request[ T/K - D ] ) END IF END DO Anthony Danalis University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics Potential Problems: After FORTRAN compiler TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit ( TEMP1[ : ] ) TEMP1 is copied back but rArray[ S:E, T:T+K-1] = TEMP1[ : ] Data not here yet TEMP2[ : ] = sArray[ S:E, T:T+K-1] Data Flow Analysis allows F90 asynchSend ( TEMP2[ : ] ) compiler to re-define TEMP2 after sArray[ S:E, T:T+K-1] = TEMP2[ : ] this copy, but Data not departed yet Anthony Danalis University of Delaware
Recommend
More recommend