3D Particle Methods Code Parallelization Fabrice Schlegel
Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab cluster, Pharos, that consists of 60 Intel Xeon -Harpertown nodes, each consisting of dual quad-core CPUs (8 processors) of 2.66 GHz speed. 2) Shaheen, a 16-rack IBM Blue Gene/P system owned by KAUST University. Four 850 MHz processors are integrated on each Blue Gene/P chip. A standard Blue Gene/P houses 4,096 processors per rack. Parallel library: MPI
Outline Brief overview of Lagrangian vortex methods Comparison of the Ring and Copy parallelization algorithm – Numerical results in terms of speed and parallel efficiency Previous code modifications to account for the new data structure Applications of the ring algorithm to large transverse jet simulations
Vortex Element Methods ω u Vortex simulations 1 Gr . . u u g Vorticity ω instead of velocity u : 2 r Re Re t more compact support N ( , ) ( ) ( ( )) x t W t f x t i i 1 ω d u i ( ,t) i dt u d W i ( ). ( ,t) W t u i i dt An element described by a i discrete node point {c i } & {W i } i 1 i 1 Efficient utilization of computational elements
Fast summation for the velocity evaluation Serial treecode [Lindsay and Krasny 2001] – Velocity from Rosenhead-Moore kernel: N 1 x y RM ( ) ( , ) u x K x y RM ( , ) K x y j j i i 3/2 4 2 2 | | 1 i x y – Constructs an adaptive oct-tree: use Taylor expansion of kernel in Cartesian coordinates: N 1 y y c p p c ( ) ( , )( ) i u x D K x y y y arg arg t et y t et c i c i ! p i 1 p x arg t et Taylor coefficients, computed Cell moments, with recurrence relation stored for re-use
Clustering of particle for optimal load balancing Distribution of particles 2 000 000 Particles 256 Processors/Clusters A k-mean clustering algorithm is used.
Copy vs. Ring Algorithm P1 P2 M1 P3-4 M1 M2 M2 M3 M3 M4 M4 Copy Algorithm: each processor keeps a copy of all the particles involved, then they communicate their results
Copy vs. Ring Algorithm P3 P3 P2 P2 P4 P4 P1 Ring: each processor keeps track of its own P1 particles and communicates the sources with the others whenever needed
Copy vs. Ring Algorithm P4 P3 P3 P2 P4 P1 P1 P2
Parallel Performance (Watson-Shaheen) CPU time for advection vs. Number of processors, 2 000 000 particles Strong Scaling 632 532 CPU Time (s) 432 ring 332 copy 232 132 32 64 256 1024 Number of processors
Parallel Performance (Watson-Shaheen) Parallel Efficiency, 2 000 000 particles Normalized Strong Scaling 1.2 1 0.8 Parallel 0.6 efficiency Ring Copy 0.4 0.2 0 64 256 1024 Number of processors
Comparison Ring algorithm Copy algorithm • Allows for bigger number of • Less communication: computational points. Recommended for high numbers of processors • Too much data circulation, its efficiency decreases for high • Memory limited number of processors
Parallel implementation using the ring algorithm Resolution of the N-Body problem • Parallel implementation using the ring algorithm (Pure MPI): • Performed simulation with 60 millions particles on 1 node, i.e., 8 processors (Pharos) 16GB. Exepted results on Shaheen: 1.8 Millions particles/processors 1.8 billions on 1024 processors New implementation of the clustering algorithm
Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P1 P3 P4
Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P3 P1 P4
Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality
Transverse jet: Motivations Wide range of applications: Combustion: industrial burners, aircraft • engines. • Pipe tee mixers. • V/STOL aerodynamics. • Pollutant dispersion (chimney plumes, effluent discharge). Secondary Boeing Military Airplane Company Com bustion Zone U.S. Department of Defense Prim ary Com bustion Turbine Inlet Zone Guide Vanes Fuel Nozzle Dilution Air Jets Prim ary Air Jets Mixing in combustors for gas turbines Higher thrust Better operability Photographed by Otto Perry (1894--1970) Lower NO x , CO, UHC Western History Department of the Denver Public Library
Numerical Results r = 10 . Re = 245 Re j = 2450 Vorticity Isosurfaces |w| = 3.5 The Ring algorithm allow for large simulations (> 5millions points), that couldn’t ne run with the copy algorithm.
Future Tasks Towards more parallel efficiency… • Find hybrid strategies between the copy and the ring algorithm , by spitting the particles in such a way as two maximize the use of local memory, and not splitting them by the number of processors when the memory limit is not reached yet. This will reduces the number of shifts in the ring algorithm, and increase its efficiency for large number of processors. • Other alternative: Use mixed Open MP/MPI programming, see next slide. This will reduces the number of shifts by the number of processor per node (8 in our case).
Mixed MPI-OpenMP implementation An other alternative would be to use MPI for shared memory and OpenMP locally (on each node), with the ring algorithm : Pros: - Easy implementation, not much modifications - Built-in load balancing subroutines - Fast summation will be more time efficient on a bigger set of particles. - Will reduce the communication time , the number of travelling cluster with ring algorithm will be reduced by the number of processor per node (8 in our case)!!!
Recommend
More recommend