finding and optimizing phases
play

Finding and Optimizing Phases in Parallel Programs Ray Chen - PowerPoint PPT Presentation

Finding and Optimizing Phases in Parallel Programs Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Scalable Tools Workshop 2016 Motivation HPC programs often contain phases Dynamic execution


  1. Finding and Optimizing Phases in Parallel Programs Ray Chen <rchen@cs.umd.edu> Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Scalable Tools Workshop 2016

  2. Motivation • HPC programs often contain “phases” – Dynamic execution context (like a stack trace for performance) – Each have distinct performance traits • Particularly disruptive if inside a timestep loop – Short phases confound tools – Difficult to analyze a rapidly changing landscape – Worse if phases are nested 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 2

  3. LULESH2 MPI Call Trace while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); } 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 3

  4. Automatic Phase Identification • Prior art (chosen completely at random) – IPS-2 – Paradyn’s Performance Consultant • Key: Automatic identification is hard – Rely on experts for annotations 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 4

  5. Guided Phase Identification while (locDom->time() < locDom->stoptime()) while (locDom->time() < locDom->stoptime()) { { cali ::Annotation region1(“ tuner.communication ”).begin(); TimeIncrement(*locDom); TimeIncrement(*locDom); region1.end(); cali ::Annotation region2(“ tuner.computation ”).begin(); LagrangeLeapFrog(*locDom); LagrangeLeapFrog(*locDom); region2.end() } } 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 5

  6. Performance Landscape 2.5KB Contextual Per Iteration Timeline Actual Timeline 3,700KB Contextual Timeline Per Iteration 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 6

  7. Cross-Domain Analysis • Utilize experts during development I know what variables affect – Library writers specify tuning variables FFTW performance – Application writers specify code regions I know what variables affect MPI My application performance has three phases I know what – Phase dictates different performance context variables affect BLAS • Even though the same function is being called performance 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 7

  8. Integration Work • Special annotation types identify: – Tunable variables – Code regions that should enable tuning • New Caliper tuning service – Listens for and reacts to special annotations – Calls Active Harmony to perform search 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 8

  9. 3D Fast Fourier Transform • FFT in 3 dimensions – Composed of three 1 dimensional FFT’s – Data is redistributed among processes between FFT’s 1 3 3 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 9

  10. Computation/Communication Overlap 3 1 3 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 1 3 1 3 3 0 2 0 2 2 1 3 1 0 0 2 A2A2 A2A1 FFTz FFTx FFTy1 FFTy2 (non-blocking) (non-blocking) 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 10

  11. Auto-tuning Opportunities T1 T2 1 3 1 3 3 0 2 0 2 2 1 3 1 0 0 2 FFTz FFTx A2A1 FFTy1 FFTy2 A2A2 (non-blocking) (non-blocking) T1 T1 y z T1 T1 Ux1 Px1 Nz / p2 Ny / p2 3 1 Py1 Uz1 0 2 x x 3 1 0 2 Unpack & FFTy1 FFTz & Pack 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 11

  12. Nested Phases • Block size during A2A transfer is tunable – Relatively independent from other variables – May be tuned as a nested sub-phase • Outer and inner phases run in tandum 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 12

  13. Online Auto-Tuning 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 13

  14. Offline Auto-Tuning Cost 14

  15. Online vs. Offline Tuning • Improvements over offline tuning – Nested phases simplifies search complexity – Reduce search dimensions from 24 to 16 – 40% fewer search steps needed to converge – Equivalent performance after convergence • Eliminates need for training runs – Don’t allocate thousands of nodes to train 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 15

  16. Conclusion • Phases are key for HPC analysis tools – Rely on human guidance through annotations • Annotations unite cross-domain expertise – Libraries annotate variables to analyze – Application annotate regions to analyze • Currently analyzing other HPC codes – HPGMG has natural phases to exploit – AMR codes are next in line 8/2/16 Finding and Optimizing Phases in Parallel Programs: Scalable Tools Workshop 16

Recommend


More recommend