larg arge e scale ale larg arge e scale ale dark dark
play

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI


  1. Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI Strategic Field Program 5 ) ( HPCI Strategic Field Program 5 )

  2. Standard cosmolo logy Standard cosmolo logy Cold dark matter (CDM DM) model Cold dark matter (CDM DM) model ● Dark matter dominates gravitational structure formation in the Universe. Five times as much as baryonic matter ● Primordial small density fluctuations ● Smaller dark matter structures gravitationally collapse first, they ● merge into larger structures (dark matter halo) Baryonic matter condenses in the potential of dark matter halos, ● forms galaxies large scale structure dark matter halo galaxy

  3. Large scale le dark rk matter r sim imula latio ion Large scale le dark rk matter r sim imula latio ion ● High resolution observations ● Wide field -> large, high resolution simulations ● 4096 3 -8192 3 particles in 800-1600Mpc Box ● Preparations for other large scale simulations ● Galaxy formation ● Star formation

  4. Para ralle lel l Scalin ling Para ralle lel l Scalin ling ● Parallel calculation is not trivial Fast network ● comminication overhead ● Gravity: long-range force ● long communication Node (CPU + Memory) ● Cosmological N-body simulation codes can run efficiently on a few hundreds nodes systems. But... Gadget-2 (Springel 2005)

  5. CfCA Cray-XC30 (~1060 noces, ~25440 cores) PC cluster CfCA Cray-XT4 (~100 cores) (~700 nodes, ~3000 cores) K Computer ~80000 nodes How can an we get a a good How can an we get a a good ~0.66M cores efficiency on ~1M cores efficiency on ~1M cores systems? systems? Here we use K computer World’s third fastes est / most power er co comsuming system em

  6. Well ll known N-body algorit rithm Well ll known N-body algorit rithm ● Tree ( Barnes and Hut, 1986 ) Solves the forces from distant particles ● by multipole expansions O( NlogN ) ● ● Particle-Mesh ( PM : e.g. Hockney and Eastwood, 1981 ) Calculates the mass density on a coarse-grained mesh ● Solves Poission equation by FFT ( ρ ➙ Φ ) ● Advantage very fast and periodic boundary is naturally satisfied ● Disadvantage resolution is restricted by mesh size ●

  7. TreePM PM N-body solv lver TreePM PM N-body solv lver ● Tr Tree ee for short-range forces ● PM ( particle mesh) for long-range forces ● O(N 2 ) to O(NlogN) ● Suitable for periodic boundary ● Becomes popular from 21 th GOTPM (Dubinski+ 2004) ● GADGET-2 (Springel 2005) ● GreeM (Ishiyama+ 2009) ● HACC (Habib+ 2012) ●

  8. Para ralle lel cal alcula latio ion procedure Para ralle lel cal alcula latio ion procedure 1. 3D domain decomposition, re-distributed particles short and long communication ● 2. Calc long-range forces (PM) long communication ● 3. Calc short-range forces (Tree) short communication ● 4. Time integration (second order leapfrog) 5. Return to 1

  9. Technical l is issues Technical l is issues ● Dense stuructures form everywhere via the gravitationaly instability Dynamic domain update ● with a good load balancer ● Gravity is the long-range force Hierarchical communication ● ● O ( N log N ) is stil expensive SIMD ● ( GRAPEs, GPU ) ●

  10. Optimizatio ion 1 Optimizatio ion 1 Dy Dynam amic ic domain in update Dynam Dy amic ic domain in update ● ∆ Space filling curve ● ○ multi section Enables each node to know easily ● where to perform short communications ● ⅹ equal #particles ● ∆ equal #interactions Our r code ● ○ equal calculation time GADGET-2

  11. Load bala lancer Load bala lancer Sampling method to determine the geometry ● Each node samples particles and shares with all nodes ● Sample frequency depends on the local calculation cost ● -> realizes near-ideal load balance R~10 -3 ~ 10 -5 New geometry is adjusted so that all domains have the ● same number of sampled particles Linear weighted moving average for last 5 steps ● calculation cost #sampling particles new geometry new calculation cost

  12. Optimizatio ion2 Optimizatio ion2 Para ralle lel PM Para ralle lel PM 1.density assignment 2.local density → slab density ● long com ommunication 3.FFT ( density → potential) ● Long communication (FFTW, Fujitsu FFT) 4.slab potential → local potential ● long com ommunication 5. force interpolation 12

  13. Optimizatio ion2 Optimizatio ion2 Numeric rical l challe llenge Numeric rical l challe llenge ● If we simulate ~10000 3 particles with ~5000 3 PM mesh on the fullsystem of K computer... ● FFT is performed by only ~5000 nodes (FFTW) since only slab mesh layout is supported ● Particles are distributed in 3D domain (48x54x32= 82944 nodes) Each FFT process has to receive density meshes from ~ 5000 nodes ● MPI_Isend, MPI_Irecv would not work ● MPI_Alltoallv is easy to implement, but... 13

  14. Optimizatio ion2 Optimizatio ion2 Novel el communication algorithm Novel el communication algorithm ● Hierarchical communication ● MPI_Alltoallv( ..., MPI_COMM_WORLD) -> ● MPI_Alltoallv(..., COMM_SMALLA2A) ● MPI_Reduce(..., COMM_REDUCE)

  15. This implementation is done by Keigo Nitadori Optimizatio ion 3 Optimizatio ion 3 as an extension of Phantom-GRAPE library Gravi vity kern rnel Gravi vity kern rnel ( Nitadori+ 2006, Tanikawa+ 2012, 2013 ) 1. One if-branch is reduced 2. optimized for a SIMD hardware with FMA ● Theoretical limit is 12Gflops/core (16Gflops for DGEMM) ● 11. 1.65 65 Gflops on a simple kernel benchmark ● 97 % of the theoretical limit (73 % of the peak)

  16. Other optimizations Other optimizations K(torus network) ● Assigning nodes as the physical layout fits the network topology ● MPI + OpenMP Hybrid ● Reduce network conjection ● Overlapp communication with calculation ● Hidden comminication

  17. All shufful All shufful ex. restart ● "Alltoallv" is easy. But if p>~10000,,, ● Communication cost O( p log p ) ● Communication buffer is sufficient? ● MPI_Alltoallv (…, MPI_COMM_WORLD) Split MPI_COMM_WORLD ● O ( p log p ) MPI_Alltoallv (…, COMM_X) + MPI_Alltoallv (…, COMM_Y) + MPI_Alltoallv (…, COMM_Z) O ( 3 p 1/3 log p 1/3 ) = O( p 1/3 log p )

  18. Perfo form rmance re result lts on K K computer Perfo form rmance re result lts on K K computer ● Sca calability (204 2048 3 - 10 1024 240 3 ) sec/step 819 8192 3 ● Excellent strong scaling ● 10240 3 simulation is well scaled from 24576 to 82944 409 096 3 (full) nodes of K computer 20 2048 48 3 1024 0240 3 ● Performance ( (126 12600 00 3 ) 10 2 10 3 10 4 10 5 10 5 10 5 ● The average performance on full system (82944=48x54x32) 5.8Pflops , 5. is ~ 5. 5.8Pflops Ishiyama, Nitadori, and Makino, 2012 (arXiv: 1211:4406), 55 % of the peak speed 55 ~ 55 55 SC12 Gordon Bell Prize Winner

  19. Compari rison with other code Compari rison with other code ● Our cod ode (Ishiyama et al. 2012) ● 5.8 Pflops (55 % of peak) on K computer (10.6Pflops) ● 2.5x10 -11 sec / substep / particle ● HACC (Habib et al. 2012) ● 14 Pflops(70 % of peak) on Sequoia (20Pflops) ● 6.0x10 -11 sec / substep / particle Our code can perfor orm cosmological simulations ~5 times faster than other optimized code

  20. Ongoing sim imula latio ions Ongoing sim imula latio ions ● Semi analytic galaxy formation model ● 4096 3 800Mpc Box ● 8192 3 1600Mpc ● Cosmology ● 4096 3 3-6Gpc with and without non gaussianity ● Dark matter detection ● 4096 3 -10240 3 ~1kpc

  21. Simula lation setup 16,777,216,000 part rticle les simula latio ion Simula lation setup Visualization: Takaaki i Takeda (4D 4D2U, N National Astronomical Obser ervatory of Japan)

  22. Summary Summary ● We developed the fastest TreePM code in the world, ● Dynamic domain update with time-based load balancer ● Split MPI_COMM_WORLD ● Highly-tuned Gravity Kernel for short-range forces ● The average performance on full system of 5.8Pflops , which corresponds to K computer is ~ 5.8Pflops 55 % of the peak speed ~ 55 ● Our code can perform cosmological simulations about five times faster than other optimized code -> larger, higher resolution simulations

Recommend


More recommend