optimizing and tuning the fast multipole method for
play

Optimizing and Tuning the Fast Multipole Method for Multicore and - PowerPoint PPT Presentation

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia Tech Aparna Chandramowlishwaran, Aashay Shringarpure, Ilya Lashuk; George Biros, Richard Vuduc Lawrence Berkeley National Laboratory Sam


  1. Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia Tech – Aparna Chandramowlishwaran, Aashay Shringarpure, Ilya Lashuk; George Biros, Richard Vuduc Lawrence Berkeley National Laboratory – Sam Williams, Lenny Oliker IPDPS 2010 Tuesday, April 20, 2010

  2. Key Ideas and Findings First cross-platform single-node multicore study of tuning the fast multipole method (FMM) Explores data structures, SIMD, multithreading, mixed-precision, and tuning Show 25x speedups on Intel Nehalem, 9.4x AMD Barcelona, 37.6x Sun Victoria Falls Surprise? Multicore ~ GPU in performance & energy efficiency for the FMM Broader context: Generalized n-body problems, for particle simulation & statistical data analytics Tuesday, April 20, 2010

  3. High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc – IPDPS 2010 Tuesday, April 20, 2010

  4. High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc – IPDPS 2010 Tuesday, April 20, 2010

  5. Computing Direct vs. Tree-based Interactions Direct evaluation: O( N 2 ) Barnes-Hut: O( N log N ) Fast Multipole Method (FMM): O( N ) Tuesday, April 20, 2010

  6. Fast multipole method Given: N target points and N sources Tree type & max points per leaf, q Desired accuracy, ε Two steps Build tree Evaluate potential at all N targets We use kernel-independent FMM (KIFMM) of Ying, Zorin, Biros (2004). Tuesday, April 20, 2010

  7. V V V V U U U V V V U B U X U W V U W W W W V V V V X V V V V Recursively divide space until Tree construction each box has at most q points . Tuesday, April 20, 2010

  8. V V V V U Six phases : U U V V (1.) Upward pass (2–5.) List computations V U B U (6.) Downward pass X U W V U Phases vary in : W W W W → data parallelism V V V V → intensity (flops : mops) X V V V V Given the adaptive tree, FMM evaluation performs a Evaluation phase series of tree traversals, doing some work at each node, B . Tuesday, April 20, 2010

  9. V V V V U Six phases : U U V V (1.) Upward pass (2–5.) List computations V U B U (6.) Downward pass X U W V U Phases vary in : W W W W → data parallelism V V V V → intensity (flops : mops) X V V V V Given the adaptive tree, FMM evaluation performs a Evaluation phase series of tree traversals, doing some work at each node, B . Tuesday, April 20, 2010

  10. V V V V U U U V V V U B U Direct B ⊗ U : X → O( q 2 ) flops : O( q ) mops U W V U W W W W V V V V X V V V V U L ( B: leaf ) :- neighbors ( B ) U-List U L ( B: non-leaf ) :- empty Tuesday, April 20, 2010

  11. V V V V U U U V V In 3D, FFTs + pointwise V U B U multiplication: X → Easily vectorized U W V U → Low intensity vs. U-list W W W W V V V V X V V V V V-List V L ( B ) :- child (neigh (par ( B ))) - adj( B ) Tuesday, April 20, 2010

  12. V V V V U U U V V V U B U X Moderate intensity U W V U W W W W V V V V X V V V V W L ( B: leaf ) :- desc [par (neigh ( B )) ∩ adj ( B )] - adj ( B ) W-list W L ( B: non-leaf ) :- empty Tuesday, April 20, 2010

  13. V V V V U U U V V V U B U X Moderate intensity U W V U W W W W V V V V X V V V V X-list X L ( B ) :- { A : B ∈ W L ( A )} Tuesday, April 20, 2010

  14. V V V V U U U V V Parallelism exists: (1) among phases, with V U B U some dependencies; X (2) within each phase; U W (3) per-box. V U W W W W Do not currently exploit (1). V V V V X V V V V Essence of the computation Tuesday, April 20, 2010

  15. V V V V U Large q implies U U V V → large U-list cost, O( q 2 ) → cheaper V, W, X costs V U B U (shallower tree) X U W V U Algorithmic tuning W W W W parameter, q , has a global V V V V impact on cost. X V V V V Essence of the computation Tuesday, April 20, 2010

  16. KIFMM (our variant) K ( r ) = C requires kernel evaluations √ r with expensive flops Essence of the For instance, square-root and divide are computation expensive, sometimes not pipelined. Tuesday, April 20, 2010

  17. High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc – IPDPS 2010 Tuesday, April 20, 2010

  18. Hardware thread and core configurations Intel X5550 “Nehalem” 2-sockets x 4-cores/socket x 2-thr/core → 16 threads Fast 2.66 GHz cores, out-of-order , deep pipelines . AMD Opteron 2356 “Barcelona” 2 x 4 x 1-thr/core → 8 threads Fast 2.3 GHz cores, out-of-order , deep pipelines . Sun T5140 “Victoria Falls” 2 x 8 x 8-thr/core → 128 threads 1.166 GHz cores, in-order , shallow pipeline . How do they differ? What implications for FMM? Tuesday, April 20, 2010

  19. High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations Tuesday, April 20, 2010

  20. Optimizations Single-core, manually coded & tuned Low-level : SIMD vectorization (x86) Numerical : rsqrtps + Newton-Raphson (x86) Data : Structure reorg. (transpose or “SOA”) Traffic : Matrix-free via interprocedural loop fusion FFTW plan optimization OpenMP parallelization Algorithmic tuning of max particles per box, q Tuesday, April 20, 2010

  21. Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) 600% Nehalem 500% 400% speedup 300% 200% 100% 0% -100% Tree Up U list V list W list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation Reference : kifmm3d [Ying, Langston, Zorin, Biros] Tuesday, April 20, 2010

  22. Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) 600% Nehalem 500% SIMD → 85.5 (double), 400% 170.6 (single) Gflop/s speedup 300% Reciprocal square- 200% root → 0.853 (double), 42.66 (single) Gflop/s 100% 0% -100% Tree Up U list V list W list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation x86 has fast approximate single-precision rsqrt , exploitable in double. Tuesday, April 20, 2010

  23. Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) ~ 4.5x ~ 2.2x ~ 1.4x 600% 300% 55% Nehalem Barcelona Victoria Falls 50% 250% 500% 45% 40% 200% 400% 35% speedup speedup 150% 300% speedup 30% 25% 100% 200% 20% 50% 100% 15% 10% 0% 0% 5% 0% -50% -100% Tree V list W list X list Up U list Down Tree V list W list X list Up U list Down Tree W list Up U list V list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation Less impact on Barcelona (why?) and Victoria Falls. Tuesday, April 20, 2010

  24. Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 500 Reference Serial 400 Seconds 300 200 168 100 0 50 100 250 500 750 Maximum Particles per Box Tree shape and relative component costs vary as q varies. Tuesday, April 20, 2010

  25. Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 500 Reference Serial Optimized Serial 400 Seconds 300 200 168 100 0 50 100 250 500 750 Maximum Particles per Box Shape of curve changes as we introduce optimizations. Tuesday, April 20, 2010

  26. Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 Reference Serial 500 Optimized Serial Optimized Parallel 400 Seconds 300 200 168 100 10.4 0 50 100 250 500 750 Maximum Particles per Box Shape of curve changes as we introduce optimizations. Tuesday, April 20, 2010

  27. Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only Breakdown by List 600 14.0 12.0 Reference Serial 500 Optimized Serial 10.0 U list Optimized Parallel 400 Seconds Seconds 8.0 300 6.0 200 4.0 168 100 2.0 10.4 0 0.0 50 100 250 500 750 50 100 250 500 750 Maximum Particles per Box Maximum Particles per Box Why? Consider phase costs for the “Optimized Parallel” implementation. Tuesday, April 20, 2010

  28. Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only Breakdown by List 600 14.0 12.0 Reference Serial 500 Optimized Serial 10.0 U list Optimized Parallel 400 Seconds Seconds 8.0 300 6.0 200 4.0 168 100 2.0 10.4 0 0.0 50 100 250 500 750 50 100 250 500 750 Maximum Particles per Box Maximum Particles per Box Recall: Cost(U-list) ~ O( q 2 ) per box Tuesday, April 20, 2010

  29. Algorithmic Tuning of q = Max pts / box Nehalem Breakdown by List 14.0 V V V V 12.0 U U list U U V V 10.0 Seconds V list 8.0 V U B U X 6.0 U W V U 4.0 W W W W 2.0 V V V V X 0.0 V V V V 50 100 250 500 750 Maximum Particles per Box A more shallow tree reduces cost of V-list phase. Tuesday, April 20, 2010

Recommend


More recommend