automatic mpi application transformation with asphalt
play

Automatic MPI application transformation with ASPhALT Anthony - PowerPoint PPT Presentation

Automatic MPI application transformation with ASPhALT Anthony Danalis Lori Pollock Martin Swany University of Delaware University of Delaware Motivation Overview Transformation Automation


  1. Automatic MPI application transformation with ASPhALT Anthony Danalis Lori Pollock Martin Swany University of Delaware University of Delaware

  2. Motivation Overview Transformation Automation Evaluation Future Work Problem Anthony Danalis University of Delaware

  3. Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Anthony Danalis University of Delaware

  4. Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Automatic cake making machine Anthony Danalis University of Delaware

  5. Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Proposed Solution: An automatic automatic system that transforms transforms simple Automatic cake communication code into efficient code. making machine Anthony Danalis University of Delaware

  6. Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Requirements: ✔ Achieve high-performance communication Have your cake ✔ Simplify the MPI code developers write + Eat your cake Proposed Solution: An automatic automatic system that transforms transforms simple Automatic cake communication code into efficient code. making machine Side-effect: Enables legacy parallel MPI applications legacy parallel MPI applications to scale, even if written without any knowledge of this system without any knowledge Anthony Danalis University of Delaware

  7. Motivation Overview Transformation Automation Evaluation Future Work Overall Research Goal Overall Research Goal Application Cluster Layers ASPhALT: Information from multiple Runtime Libraries layers contributes to source optimization Operating System/ Network Anthony Danalis University of Delaware

  8. Motivation Overview Transformation Automation Evaluation Future Work Our Framework : ASPhALT * Original Executable Code Source to Source Optimized Existing Optimizer Application Source Code Compiler Analyzer Low Level System System Comm. API Benchmarks Parameters Operating System/ Application Runtime Libraries Network Anthony Danalis University of Delaware

  9. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  10. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  11. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  12. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  13. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  14. Motivation Overview Transformation Automation Evaluation Future Work “Prepushing” Transformation Anthony Danalis University of Delaware

  15. Motivation Overview Transformation Automation Evaluation Future Work Comm. Aggregation vs. Performance Traditional Approach: Low Overhead High High + Communication Aggregation Performance High Bandwidth Anthony Danalis University of Delaware

  16. Motivation Overview Transformation Automation Evaluation Future Work Comm. Aggregation vs. Performance Traditional Approach: Low Overhead High High + Communication Aggregation Performance High Bandwidth Why our communication segmentation works? High Overhead on the network not the CPU High Fine Grain Application Communication Low Bandwidth but Performance transfer Overlapped i.e. CPU not idle Anthony Danalis University of Delaware

  17. Motivation Overview Transformation Automation Evaluation Future Work Transformer Prototype Anthony Danalis University of Delaware

  18. Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel ( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) END IF END DO Anthony Danalis University of Delaware

  19. Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 P 1 kernel ( sArray[ : , I ], ... ) END DO P 2 DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) P 1 P 2 END IF END DO Anthony Danalis University of Delaware

  20. Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT Before ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] sArray[ NX, NY ] DO T = 1, N, K DO I = 1,N kernel ( sArray[ : , I ], ... ) DO P = 1, NPROC END DO S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) synchrnsTransfer ( sArray[ : , : ], rArray[ : , : ]) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 P 1 kernel ( sArray[ : , I ], ... ) END DO P 2 DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait ( request[ T/K - D ] ) P 1 P 2 END IF END DO Anthony Danalis University of Delaware

  21. Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics After ASPhALT sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K After FORTRAN compiler DO P = 1, NPROC S = F( NX, P, NPROC ) TEMP1[ : ] = rArray[ S:E, T:T+K-1] E = G( NX, P, NPROC ) asynchRecvInit ( rArray[ S:E, T:T+K-1], req[ T/K ] ) asynchRecvInit ( TEMP1[ : ] ) END DO rArray[ S:E, T:T+K-1] = TEMP1[ : ] DO I = T, T+K-1 kernel ( sArray[ : , I ], ... ) Array slice means implicit END DO copy for data to be contiguous DO P = 1, NPROC S = F( NX, P, NPROC ) TEMP2[ : ] = sArray[ S:E, T:T+K-1] E = G( NX, P, NPROC ) asynchSendInit ( sArray[ S:E, T:T+K-1] ) asynchSend ( TEMP2[ : ] ) END DO sArray[ S:E, T:T+K-1] = TEMP2[ : ] IF( T/K > D ) THEN wait ( request[ T/K - D ] ) END IF END DO Anthony Danalis University of Delaware

  22. Motivation Overview Transformation Automation Evaluation Future Work Fortran Semantics vs. MPI Semantics Potential Problems: After FORTRAN compiler TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit ( TEMP1[ : ] ) TEMP1 is copied back but rArray[ S:E, T:T+K-1] = TEMP1[ : ] Data not here yet TEMP2[ : ] = sArray[ S:E, T:T+K-1] Data Flow Analysis allows F90 asynchSend ( TEMP2[ : ] ) compiler to re-define TEMP2 after sArray[ S:E, T:T+K-1] = TEMP2[ : ] this copy, but Data not departed yet Anthony Danalis University of Delaware

Recommend


More recommend