Optimizing and Tuning the Fast Multipole Method for Multicore and - PowerPoint PPT Presentation

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia Tech – Aparna Chandramowlishwaran, Aashay Shringarpure, Ilya Lashuk; George Biros, Richard Vuduc Lawrence Berkeley National Laboratory – Sam Williams, Lenny Oliker IPDPS 2010 Tuesday, April 20, 2010

Key Ideas and Findings First cross-platform single-node multicore study of tuning the fast multipole method (FMM) Explores data structures, SIMD, multithreading, mixed-precision, and tuning Show 25x speedups on Intel Nehalem, 9.4x AMD Barcelona, 37.6x Sun Victoria Falls Surprise? Multicore ~ GPU in performance & energy efficiency for the FMM Broader context: Generalized n-body problems, for particle simulation & statistical data analytics Tuesday, April 20, 2010

High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc – IPDPS 2010 Tuesday, April 20, 2010

Computing Direct vs. Tree-based Interactions Direct evaluation: O( N 2 ) Barnes-Hut: O( N log N ) Fast Multipole Method (FMM): O( N ) Tuesday, April 20, 2010

Fast multipole method Given: N target points and N sources Tree type & max points per leaf, q Desired accuracy, ε Two steps Build tree Evaluate potential at all N targets We use kernel-independent FMM (KIFMM) of Ying, Zorin, Biros (2004). Tuesday, April 20, 2010

V V V V U U U V V V U B U X U W V U W W W W V V V V X V V V V Recursively divide space until Tree construction each box has at most q points . Tuesday, April 20, 2010

V V V V U Six phases : U U V V (1.) Upward pass (2–5.) List computations V U B U (6.) Downward pass X U W V U Phases vary in : W W W W → data parallelism V V V V → intensity (flops : mops) X V V V V Given the adaptive tree, FMM evaluation performs a Evaluation phase series of tree traversals, doing some work at each node, B . Tuesday, April 20, 2010

V V V V U U U V V V U B U Direct B ⊗ U : X → O( q 2 ) flops : O( q ) mops U W V U W W W W V V V V X V V V V U L ( B: leaf ) :- neighbors ( B ) U-List U L ( B: non-leaf ) :- empty Tuesday, April 20, 2010

V V V V U U U V V In 3D, FFTs + pointwise V U B U multiplication: X → Easily vectorized U W V U → Low intensity vs. U-list W W W W V V V V X V V V V V-List V L ( B ) :- child (neigh (par ( B ))) - adj( B ) Tuesday, April 20, 2010

V V V V U U U V V V U B U X Moderate intensity U W V U W W W W V V V V X V V V V W L ( B: leaf ) :- desc [par (neigh ( B )) ∩ adj ( B )] - adj ( B ) W-list W L ( B: non-leaf ) :- empty Tuesday, April 20, 2010

V V V V U U U V V V U B U X Moderate intensity U W V U W W W W V V V V X V V V V X-list X L ( B ) :- { A : B ∈ W L ( A )} Tuesday, April 20, 2010

V V V V U U U V V Parallelism exists: (1) among phases, with V U B U some dependencies; X (2) within each phase; U W (3) per-box. V U W W W W Do not currently exploit (1). V V V V X V V V V Essence of the computation Tuesday, April 20, 2010

V V V V U Large q implies U U V V → large U-list cost, O( q 2 ) → cheaper V, W, X costs V U B U (shallower tree) X U W V U Algorithmic tuning W W W W parameter, q , has a global V V V V impact on cost. X V V V V Essence of the computation Tuesday, April 20, 2010

KIFMM (our variant) K ( r ) = C requires kernel evaluations √ r with expensive flops Essence of the For instance, square-root and divide are computation expensive, sometimes not pipelined. Tuesday, April 20, 2010

High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, R. Vuduc – IPDPS 2010 Tuesday, April 20, 2010

Hardware thread and core configurations Intel X5550 “Nehalem” 2-sockets x 4-cores/socket x 2-thr/core → 16 threads Fast 2.66 GHz cores, out-of-order , deep pipelines . AMD Opteron 2356 “Barcelona” 2 x 4 x 1-thr/core → 8 threads Fast 2.3 GHz cores, out-of-order , deep pipelines . Sun T5140 “Victoria Falls” 2 x 8 x 8-thr/core → 128 threads 1.166 GHz cores, in-order , shallow pipeline . How do they differ? What implications for FMM? Tuesday, April 20, 2010

High-performance multicore FMMs: Analysis, optimization, and tuning Algorithmic characteristics Architectural implications Observations Tuesday, April 20, 2010

Optimizations Single-core, manually coded & tuned Low-level : SIMD vectorization (x86) Numerical : rsqrtps + Newton-Raphson (x86) Data : Structure reorg. (transpose or “SOA”) Traffic : Matrix-free via interprocedural loop fusion FFTW plan optimization OpenMP parallelization Algorithmic tuning of max particles per box, q Tuesday, April 20, 2010

Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) 600% Nehalem 500% 400% speedup 300% 200% 100% 0% -100% Tree Up U list V list W list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation Reference : kifmm3d [Ying, Langston, Zorin, Biros] Tuesday, April 20, 2010

Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) 600% Nehalem 500% SIMD → 85.5 (double), 400% 170.6 (single) Gflop/s speedup 300% Reciprocal square- 200% root → 0.853 (double), 42.66 (single) Gflop/s 100% 0% -100% Tree Up U list V list W list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation x86 has fast approximate single-precision rsqrt , exploitable in double. Tuesday, April 20, 2010

Single-core Optimizations N s = N t = 4M, Double-Precision, Non-uniform (ellipsoidal) ~ 4.5x ~ 2.2x ~ 1.4x 600% 300% 55% Nehalem Barcelona Victoria Falls 50% 250% 500% 45% 40% 200% 400% 35% speedup speedup 150% 300% speedup 30% 25% 100% 200% 20% 50% 100% 15% 10% 0% 0% 5% 0% -50% -100% Tree V list W list X list Up U list Down Tree V list W list X list Up U list Down Tree W list Up U list V list X list Down +SIMDization +Newton-Raphson +Structure of Arrays +Matrix-Free +FFTW Approximation Computation Less impact on Barcelona (why?) and Victoria Falls. Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 500 Reference Serial 400 Seconds 300 200 168 100 0 50 100 250 500 750 Maximum Particles per Box Tree shape and relative component costs vary as q varies. Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 500 Reference Serial Optimized Serial 400 Seconds 300 200 168 100 0 50 100 250 500 750 Maximum Particles per Box Shape of curve changes as we introduce optimizations. Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only 600 Reference Serial 500 Optimized Serial Optimized Parallel 400 Seconds 300 200 168 100 10.4 0 50 100 250 500 750 Maximum Particles per Box Shape of curve changes as we introduce optimizations. Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only Breakdown by List 600 14.0 12.0 Reference Serial 500 Optimized Serial 10.0 U list Optimized Parallel 400 Seconds Seconds 8.0 300 6.0 200 4.0 168 100 2.0 10.4 0 0.0 50 100 250 500 750 50 100 250 500 750 Maximum Particles per Box Maximum Particles per Box Why? Consider phase costs for the “Optimized Parallel” implementation. Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Force Evaluation Only Breakdown by List 600 14.0 12.0 Reference Serial 500 Optimized Serial 10.0 U list Optimized Parallel 400 Seconds Seconds 8.0 300 6.0 200 4.0 168 100 2.0 10.4 0 0.0 50 100 250 500 750 50 100 250 500 750 Maximum Particles per Box Maximum Particles per Box Recall: Cost(U-list) ~ O( q 2 ) per box Tuesday, April 20, 2010

Algorithmic Tuning of q = Max pts / box Nehalem Breakdown by List 14.0 V V V V 12.0 U U list U U V V 10.0 Seconds V list 8.0 V U B U X 6.0 U W V U 4.0 W W W W 2.0 V V V V X 0.0 V V V V 50 100 250 500 750 Maximum Particles per Box A more shallow tree reduces cost of V-list phase. Tuesday, April 20, 2010

Optimizing and Tuning the Fast Multipole Method for Multicore and - PowerPoint PPT Presentation

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia Tech Aparna Chandramowlishwaran, Aashay Shringarpure, Ilya Lashuk; George Biros, Richard Vuduc Lawrence Berkeley National Laboratory Sam

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions

MULTIPOLE EXPANSION 5.4.3 5.30 The leading term in the vector potential multipole

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11,

Differential Algebra (DA) based Fast Multipole Method (FMM) He Zhang, Martin Berz, Kyoko Makino

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Fast Multipole Methods in Arbitrary Dimensions with Chenhan Yu James Levitt Severin Riez

Symmetry analysis and multipole classification of eigenmodes in electromagnetic resonators Sergey

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

The Use of Ever Increasing Datasets in Macroeconomic Forecasting Prof. Dr. Jan-Egbert Sturm 12.

Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle Tal Ben-Nun , Ely Levy, Amnon

P ricing Default Correlation Products within Structural framework. Authors: Luis Seco, Marcos

Combining Faceted Search with Data-analytic Visualizations on Top of a SPARQL Endpoint VOILA!,

802.1 Closing Plenary July 2014 San Diego Glenn Parsons Chair, IEEE 802.1 WG

Nausheen R Shah Nausheen R. Shah Particle Physics Division Theory Fermi National Accelerator

UI command UI command Makoto Asai (SLAC Computing Services) Makoto Asai (SLAC Computing

Ultra-High Energy Cosmic Rays (Very short) reminder on Cosmic Ray experimental situation and