finding and optimizing phases in parallel programs
play

Finding and Optimizing Phases in Parallel Programs Jeffrey K. - PowerPoint PPT Presentation

Finding and Optimizing Phases in Parallel Programs Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Ray Chen <rchen@cs.umd.edu> Phases of UMD CS Computer & Space Sciences: 1962-1987 AV Williams: 1987-2018 Iribe Center: 2018-


  1. Finding and Optimizing Phases in Parallel Programs Jeffrey K. Hollingsworth <hollings@cs.umd.edu> Ray Chen <rchen@cs.umd.edu>

  2. Phases of UMD CS Computer & Space Sciences: 1962-1987 AV Williams: 1987-2018 Iribe Center: 2018- � CS@UMD: Future is Exciting � Largest Major on campus (over 2,800 undergrads, plus 400+ Computer Engineering) � New Building in 2018 � Hiring O(10) New Faculty in couple of years � New Big Data Masters & Certificate Programs

  3. Motivation • HPC programs often contain “phases” – Dynamic execution context – Each have distinct performance traits • Particularly problematic if inside a time-step loop – Short phases confound tools – Difficult to analyze a rapidly changing landscape – Worse if phases are nested 3

  4. LULESH2 MPI Call Trace while (locDom->time() < locDom->stoptime()) { TimeIncrement(*locDom); LagrangeLeapFrog(*locDom); } 4

  5. Automatic Phase Identification • My Failed Prior Attempts – IPS-2 (c. 1990) – Paradyn’s Performance Consultant (c. 1995) – Solution • Automatic identification is hard, rely on experts for annotations • Create virtual phases by stitching short ones together 5

  6. Guided Phase Identification while (locDom->time() < locDom->stoptime()) while (locDom->time() < locDom->stoptime()) { { cali::Annotation region1(“tuner.communication”).begin(); TimeIncrement(*locDom); TimeIncrement(*locDom); region1.end(); cali::Annotation region2(“tuner.computation”).begin(); LagrangeLeapFrog(*locDom); LagrangeLeapFrog(*locDom); region2.end() } } 6

  7. Performance Landscape 2.5KB Contextual Per Iteration Timeline Actual Timeline 3,700KB Contextual Timeline Per Iteration 7

  8. Cross-Domain Analysis • Utilize experts during development I know what variables affect – Library writers specify tuning variables FFTW performance – Application writers specify code regions I know what variables affect MPI My application performance has three phases I know what – Phase dictates different performance context variables affect BLAS • Even though the same function is being called performance 8

  9. Integration Work • Special annotation types identify: – Tunable variables – Code regions that should enable tuning • New Caliper tuning service – Listens for and reacts to special annotations – Calls Active Harmony to perform search 9

  10. 3D Fast Fourier Transform • FFT in 3 dimensions – Composed of three 1 dimensional FFT’s – Data is redistributed among processes between FFT’s 1 3 3 0 2 2 3 1 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 10

  11. Computation/Communication Overlap 1 3 3 0 2 2 3 1 1 0 0 2 FFTz FFTx A2A1 FFTy A2A2 (blocking) (blocking) 1 3 3 1 3 0 2 0 2 2 1 3 1 0 0 2 A2A2 A2A1 FFTz FFTy1 FFTx FFTy2 (non-blocking) (non-blocking) 11

  12. Auto-tuning Opportunities T1 T2 1 3 3 1 3 0 2 0 2 2 1 3 1 0 0 2 FFTz A2A1 FFTy1 FFTy2 A2A2 FFTx (non-blocking) (non-blocking) T1 T1 y z T1 T1 Px1 Ux1 Nz / p2 Ny / p2 1 3 Py1 Uz1 0 2 x x 1 3 0 2 Unpack & FFTy1 FFTz & Pack 12

  13. Online Auto-Tuning 13

  14. Phase Aware Tuning • Improvements over offline (non-phase) tuning – Reduce search dimensions from 24 to 16 – 40% fewer search steps needed to converge – Equivalent performance after convergence • Eliminates need for training runs – Don’t allocate thousands of nodes to train 14

  15. Offline Auto-Tuning Cost 15

  16. Conclusion • Phases are key for HPC analysis tools – Rely on human guidance through annotations – Virtualizing repeated phases helps many types of tools • Annotations unite cross-domain expertise – Libraries annotate variables to analyze – Application annotate regions to analyze • Currently analyzing other HPC codes 16

Recommend


More recommend