code parallelization
play

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - PowerPoint PPT Presentation

3D Particle Methods Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab


  1. 3D Particle Methods Code Parallelization Fabrice Schlegel

  2. Introduction Goal: Efficient parallelization and memory optimization of a  CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab cluster, Pharos, that consists of 60 Intel  Xeon -Harpertown nodes, each consisting of dual quad-core CPUs (8 processors) of 2.66 GHz speed. 2) Shaheen, a 16-rack IBM Blue Gene/P system owned by KAUST University. Four 850 MHz processors are integrated on each Blue Gene/P chip. A standard Blue Gene/P houses 4,096 processors per rack. Parallel library: MPI 

  3. Outline Brief overview of Lagrangian vortex methods  Comparison of the Ring and Copy parallelization algorithm  – Numerical results in terms of speed and parallel efficiency Previous code modifications to account for the new data  structure Applications of the ring algorithm to large transverse jet  simulations

  4. Vortex Element Methods   ω u Vortex simulations    1 Gr             . . u u g Vorticity ω instead of velocity u :  2 r Re Re t more compact support N       ( , ) ( ) ( ( )) x t W t f x t  i i 1  ω d  u  i ( ,t) i dt u d W    i ( ). ( ,t) W t u i i dt  An element described by a  i discrete node point {c i } & {W i }  i  1  i  1  Efficient utilization of computational elements ฀  ฀  ฀ 

  5. Fast summation for the velocity evaluation  Serial treecode [Lindsay and Krasny 2001] – Velocity from Rosenhead-Moore kernel: N   1 x y    RM   ( ) ( , ) u x K  x y RM ( , ) K x y   j j i i   3/2 4    2 2  | | 1 i x y – Constructs an adaptive oct-tree:  use Taylor expansion of kernel in Cartesian coordinates: N 1 y  y c     p p c ( ) ( , )( ) i u x D K x y y y  arg arg t et y t et c i c i ! p  i 1 p x arg t et Taylor coefficients, computed Cell moments, with recurrence relation stored for re-use

  6. Clustering of particle for optimal load balancing Distribution of particles 2 000 000 Particles 256 Processors/Clusters A k-mean clustering algorithm is used.

  7. Copy vs. Ring Algorithm P1 P2 M1 P3-4 M1 M2 M2 M3 M3 M4 M4 Copy Algorithm: each processor keeps a copy of all the particles involved, then they communicate their results

  8. Copy vs. Ring Algorithm P3 P3 P2 P2 P4 P4 P1 Ring: each processor keeps track of its own P1 particles and communicates the sources with the others whenever needed

  9. Copy vs. Ring Algorithm P4 P3 P3 P2 P4 P1 P1 P2

  10. Parallel Performance (Watson-Shaheen) CPU time for advection vs. Number of processors, 2 000 000 particles Strong Scaling 632 532 CPU Time (s) 432 ring 332 copy 232 132 32 64 256 1024 Number of processors

  11. Parallel Performance (Watson-Shaheen) Parallel Efficiency, 2 000 000 particles Normalized Strong Scaling 1.2 1 0.8 Parallel 0.6 efficiency Ring Copy 0.4 0.2 0 64 256 1024 Number of processors

  12. Comparison Ring algorithm Copy algorithm • Allows for bigger number of • Less communication: computational points. Recommended for high numbers of processors • Too much data circulation, its efficiency decreases for high • Memory limited number of processors

  13. Parallel implementation using the ring algorithm Resolution of the N-Body problem • Parallel implementation using the ring algorithm (Pure MPI): • Performed simulation with 60 millions particles on 1 node, i.e., 8 processors (Pharos) 16GB. Exepted results on Shaheen: 1.8 Millions particles/processors 1.8 billions on 1024 processors New implementation of the clustering algorithm

  14. Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P1 P3 P4

  15. Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P3 P1 P4

  16. Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality

  17. Transverse jet: Motivations Wide range of applications: Combustion: industrial burners, aircraft • engines. • Pipe tee mixers. • V/STOL aerodynamics. • Pollutant dispersion (chimney plumes, effluent discharge). Secondary Boeing Military Airplane Company Com bustion Zone U.S. Department of Defense Prim ary Com bustion Turbine Inlet Zone Guide Vanes Fuel Nozzle Dilution Air Jets Prim ary Air Jets Mixing in combustors for gas turbines  Higher thrust  Better operability Photographed by Otto Perry (1894--1970)  Lower NO x , CO, UHC Western History Department of the Denver Public Library

  18. Numerical Results r = 10 . Re = 245 Re j = 2450 Vorticity Isosurfaces |w| = 3.5 The Ring algorithm allow for large simulations (> 5millions points), that couldn’t ne run with the copy algorithm.

  19. Future Tasks Towards more parallel efficiency… • Find hybrid strategies between the copy and the ring algorithm , by spitting the particles in such a way as two maximize the use of local memory, and not splitting them by the number of processors when the memory limit is not reached yet. This will reduces the number of shifts in the ring algorithm, and increase its efficiency for large number of processors. • Other alternative: Use mixed Open MP/MPI programming, see next slide. This will reduces the number of shifts by the number of processor per node (8 in our case).

  20. Mixed MPI-OpenMP implementation An other alternative would be to use MPI for shared memory and OpenMP locally (on each node), with the ring algorithm : Pros: - Easy implementation, not much modifications - Built-in load balancing subroutines - Fast summation will be more time efficient on a bigger set of particles. - Will reduce the communication time , the number of travelling cluster with ring algorithm will be reduced by the number of processor per node (8 in our case)!!!

Recommend


More recommend