Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI Strategic Field Program 5 ) ( HPCI Strategic Field Program 5 )

Standard cosmolo logy Standard cosmolo logy Cold dark matter (CDM DM) model Cold dark matter (CDM DM) model ● Dark matter dominates gravitational structure formation in the Universe. Five times as much as baryonic matter ● Primordial small density fluctuations ● Smaller dark matter structures gravitationally collapse first, they ● merge into larger structures (dark matter halo) Baryonic matter condenses in the potential of dark matter halos, ● forms galaxies large scale structure dark matter halo galaxy

Large scale le dark rk matter r sim imula latio ion Large scale le dark rk matter r sim imula latio ion ● High resolution observations ● Wide field -> large, high resolution simulations ● 4096 3 -8192 3 particles in 800-1600Mpc Box ● Preparations for other large scale simulations ● Galaxy formation ● Star formation

Para ralle lel l Scalin ling Para ralle lel l Scalin ling ● Parallel calculation is not trivial Fast network ● comminication overhead ● Gravity: long-range force ● long communication Node (CPU + Memory) ● Cosmological N-body simulation codes can run efficiently on a few hundreds nodes systems. But... Gadget-2 (Springel 2005)

CfCA Cray-XC30 (~1060 noces, ~25440 cores) PC cluster CfCA Cray-XT4 (~100 cores) (~700 nodes, ~3000 cores) K Computer ~80000 nodes How can an we get a a good How can an we get a a good ~0.66M cores efficiency on ~1M cores efficiency on ~1M cores systems? systems? Here we use K computer World’s third fastes est / most power er co comsuming system em

Well ll known N-body algorit rithm Well ll known N-body algorit rithm ● Tree ( Barnes and Hut, 1986 ) Solves the forces from distant particles ● by multipole expansions O( NlogN ) ● ● Particle-Mesh ( PM : e.g. Hockney and Eastwood, 1981 ) Calculates the mass density on a coarse-grained mesh ● Solves Poission equation by FFT ( ρ ➙ Φ ) ● Advantage very fast and periodic boundary is naturally satisfied ● Disadvantage resolution is restricted by mesh size ●

TreePM PM N-body solv lver TreePM PM N-body solv lver ● Tr Tree ee for short-range forces ● PM ( particle mesh) for long-range forces ● O(N 2 ) to O(NlogN) ● Suitable for periodic boundary ● Becomes popular from 21 th GOTPM (Dubinski+ 2004) ● GADGET-2 (Springel 2005) ● GreeM (Ishiyama+ 2009) ● HACC (Habib+ 2012) ●

Para ralle lel cal alcula latio ion procedure Para ralle lel cal alcula latio ion procedure 1. 3D domain decomposition, re-distributed particles short and long communication ● 2. Calc long-range forces (PM) long communication ● 3. Calc short-range forces (Tree) short communication ● 4. Time integration (second order leapfrog) 5. Return to 1

Technical l is issues Technical l is issues ● Dense stuructures form everywhere via the gravitationaly instability Dynamic domain update ● with a good load balancer ● Gravity is the long-range force Hierarchical communication ● ● O ( N log N ) is stil expensive SIMD ● ( GRAPEs, GPU ) ●

Optimizatio ion 1 Optimizatio ion 1 Dy Dynam amic ic domain in update Dynam Dy amic ic domain in update ● ∆ Space filling curve ● ○ multi section Enables each node to know easily ● where to perform short communications ● ⅹ equal #particles ● ∆ equal #interactions Our r code ● ○ equal calculation time GADGET-2

Load bala lancer Load bala lancer Sampling method to determine the geometry ● Each node samples particles and shares with all nodes ● Sample frequency depends on the local calculation cost ● -> realizes near-ideal load balance R~10 -3 ~ 10 -5 New geometry is adjusted so that all domains have the ● same number of sampled particles Linear weighted moving average for last 5 steps ● calculation cost #sampling particles new geometry new calculation cost

Optimizatio ion2 Optimizatio ion2 Para ralle lel PM Para ralle lel PM 1.density assignment 2.local density → slab density ● long com ommunication 3.FFT ( density → potential) ● Long communication (FFTW, Fujitsu FFT) 4.slab potential → local potential ● long com ommunication 5. force interpolation 12

Optimizatio ion2 Optimizatio ion2 Numeric rical l challe llenge Numeric rical l challe llenge ● If we simulate ~10000 3 particles with ~5000 3 PM mesh on the fullsystem of K computer... ● FFT is performed by only ~5000 nodes (FFTW) since only slab mesh layout is supported ● Particles are distributed in 3D domain (48x54x32= 82944 nodes) Each FFT process has to receive density meshes from ~ 5000 nodes ● MPI_Isend, MPI_Irecv would not work ● MPI_Alltoallv is easy to implement, but... 13

Optimizatio ion2 Optimizatio ion2 Novel el communication algorithm Novel el communication algorithm ● Hierarchical communication ● MPI_Alltoallv( ..., MPI_COMM_WORLD) -> ● MPI_Alltoallv(..., COMM_SMALLA2A) ● MPI_Reduce(..., COMM_REDUCE)

This implementation is done by Keigo Nitadori Optimizatio ion 3 Optimizatio ion 3 as an extension of Phantom-GRAPE library Gravi vity kern rnel Gravi vity kern rnel ( Nitadori+ 2006, Tanikawa+ 2012, 2013 ) 1. One if-branch is reduced 2. optimized for a SIMD hardware with FMA ● Theoretical limit is 12Gflops/core (16Gflops for DGEMM) ● 11. 1.65 65 Gflops on a simple kernel benchmark ● 97 % of the theoretical limit (73 % of the peak)

Other optimizations Other optimizations K(torus network) ● Assigning nodes as the physical layout fits the network topology ● MPI + OpenMP Hybrid ● Reduce network conjection ● Overlapp communication with calculation ● Hidden comminication

All shufful All shufful ex. restart ● "Alltoallv" is easy. But if p>~10000,,, ● Communication cost O( p log p ) ● Communication buffer is sufficient? ● MPI_Alltoallv (…, MPI_COMM_WORLD) Split MPI_COMM_WORLD ● O ( p log p ) MPI_Alltoallv (…, COMM_X) + MPI_Alltoallv (…, COMM_Y) + MPI_Alltoallv (…, COMM_Z) O ( 3 p 1/3 log p 1/3 ) = O( p 1/3 log p )

Perfo form rmance re result lts on K K computer Perfo form rmance re result lts on K K computer ● Sca calability (204 2048 3 - 10 1024 240 3 ) sec/step 819 8192 3 ● Excellent strong scaling ● 10240 3 simulation is well scaled from 24576 to 82944 409 096 3 (full) nodes of K computer 20 2048 48 3 1024 0240 3 ● Performance ( (126 12600 00 3 ) 10 2 10 3 10 4 10 5 10 5 10 5 ● The average performance on full system (82944=48x54x32) 5.8Pflops , 5. is ~ 5. 5.8Pflops Ishiyama, Nitadori, and Makino, 2012 (arXiv: 1211:4406), 55 % of the peak speed 55 ~ 55 55 SC12 Gordon Bell Prize Winner

Compari rison with other code Compari rison with other code ● Our cod ode (Ishiyama et al. 2012) ● 5.8 Pflops (55 % of peak) on K computer (10.6Pflops) ● 2.5x10 -11 sec / substep / particle ● HACC (Habib et al. 2012) ● 14 Pflops(70 % of peak) on Sequoia (20Pflops) ● 6.0x10 -11 sec / substep / particle Our code can perfor orm cosmological simulations ~5 times faster than other optimized code

Ongoing sim imula latio ions Ongoing sim imula latio ions ● Semi analytic galaxy formation model ● 4096 3 800Mpc Box ● 8192 3 1600Mpc ● Cosmology ● 4096 3 3-6Gpc with and without non gaussianity ● Dark matter detection ● 4096 3 -10240 3 ~1kpc

Simula lation setup 16,777,216,000 part rticle les simula latio ion Simula lation setup Visualization: Takaaki i Takeda (4D 4D2U, N National Astronomical Obser ervatory of Japan)

Summary Summary ● We developed the fastest TreePM code in the world, ● Dynamic domain update with time-based load balancer ● Split MPI_COMM_WORLD ● Highly-tuned Gravity Kernel for short-range forces ● The average performance on full system of 5.8Pflops , which corresponds to K computer is ~ 5.8Pflops 55 % of the peak speed ~ 55 ● Our code can perform cosmological simulations about five times faster than other optimized code -> larger, higher resolution simulations

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI

64 th ARGE ANNUAL CONFERENCE 15 th -16 th September 2016 ARGE EPD published in September 2016 8

SN SN1000 000V R V Repor ort Month nthly R Repor eport t on on th the e Larg arge/Mi

Beyond Dark Matter and Dark Energy Sean Carroll Beyond Dark Matter and Dark Energy Sean Carroll,

Doomsday Dark Matter Doomsday Dark Matter or Some stones are better left unturned Doomsday

OLD ALE Background AKA Stock Ale: Aged to add complexity & Originally blended with milder

BEER & COCKTAIL MENU DRAFT BEERS Ale Asylum Hopalicious American Pale Ale 5.7%, Madison, WI 5

Dark Halos Dark Halos Dark Halos of Dark Halos of of of M31 and the Milky Way M31 and the

Chapter 22 Dark Matter, Dark Energy, and the Fate of the Universe 22.1 Unseen Influences in the

Chapter 22 Dark Matter, Dark Energy, and 22.1 Unseen Influences in the Cosmos the Fate of the

La Larg rge-Scal Scale Patch Patch Reco comme mmendati ation at at Alibab aba Xindong

La Larg rge-scale le Qu Quantum um Netwo work rk: : Fr From In Intra ra-cit ity y to

Optimization Algorithms asst. prof. dr. Ale s Zamuda Room B029, 14:00 asst. prof. dr. Ale s

PK -PD T ARGE T SE L E CT I ON Its All Abo ut the Go al Paul G. Ambr ose , Phar

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Credit: ESO/G Credit: ESA/Hubble DARK MATTER Dark Matter 10 x Luminous Matter Dark Matter

DARK MATTER IN DSPH Paolo Salucci (& G. Gilmore) SISSA (Oxford) Outline of the Review Dark

California State Capitol Annex Demystification of the Restoration Process | July 10, 2017 01

Adapting Infrastructure and Civil Engineering Practice to a Changing Climate: Implications for

The purpose of the Presentation is to introduce the Verification Validation Reporting

LIMITED 2016 Half Year Results Presentation Stuart Irving Chief Executive Officer and President

Y-12 CAPABILITIES AND EXPERTISE RELATED TO MO-99 Lloyd Jollay lloyd.jollay@cns.doe.gov Director

Action Plan 2019 Update Impact Indicators & Project Progress Board Prepared for Internet

Experience Feedback from Previous Review Meetings: President o Pr of Sev even enth h Rev

OPAL CNS Moderator Performance Weijian Lu OPAL CNS Reliability from Commissioning (Nov 2006) to

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI

64 th ARGE ANNUAL CONFERENCE 15 th -16 th September 2016 ARGE EPD published in September 2016 8

SN SN1000 000V R V Repor ort Month nthly R Repor eport t on on th the e Larg arge/Mi

Beyond Dark Matter and Dark Energy Sean Carroll Beyond Dark Matter and Dark Energy Sean Carroll,

Doomsday Dark Matter Doomsday Dark Matter or Some stones are better left unturned Doomsday

OLD ALE Background AKA Stock Ale: Aged to add complexity &amp; Originally blended with milder

BEER &amp; COCKTAIL MENU DRAFT BEERS Ale Asylum Hopalicious American Pale Ale 5.7%, Madison, WI 5

Dark Halos Dark Halos Dark Halos of Dark Halos of of of M31 and the Milky Way M31 and the

Chapter 22 Dark Matter, Dark Energy, and the Fate of the Universe 22.1 Unseen Influences in the

Chapter 22 Dark Matter, Dark Energy, and 22.1 Unseen Influences in the Cosmos the Fate of the

La Larg rge-Scal Scale Patch Patch Reco comme mmendati ation at at Alibab aba Xindong

La Larg rge-scale le Qu Quantum um Netwo work rk: : Fr From In Intra ra-cit ity y to

Optimization Algorithms asst. prof. dr. Ale s Zamuda Room B029, 14:00 asst. prof. dr. Ale s

PK -PD T ARGE T SE L E CT I ON Its All Abo ut the Go al Paul G. Ambr ose , Phar

A Cache-Oblivious Heap Introduced by Arge et al. [1]. Based on distribution of elements

Credit: ESO/G Credit: ESA/Hubble DARK MATTER Dark Matter 10 x Luminous Matter Dark Matter

DARK MATTER IN DSPH Paolo Salucci (&amp; G. Gilmore) SISSA (Oxford) Outline of the Review Dark

California State Capitol Annex Demystification of the Restoration Process | July 10, 2017 01

Adapting Infrastructure and Civil Engineering Practice to a Changing Climate: Implications for

The purpose of the Presentation is to introduce the Verification Validation Reporting

LIMITED 2016 Half Year Results Presentation Stuart Irving Chief Executive Officer and President

Y-12 CAPABILITIES AND EXPERTISE RELATED TO MO-99 Lloyd Jollay lloyd.jollay@cns.doe.gov Director

Action Plan 2019 Update Impact Indicators &amp; Project Progress Board Prepared for Internet

Experience Feedback from Previous Review Meetings: President o Pr of Sev even enth h Rev

OPAL CNS Moderator Performance Weijian Lu OPAL CNS Reliability from Commissioning (Nov 2006) to

OLD ALE Background AKA Stock Ale: Aged to add complexity & Originally blended with milder

BEER & COCKTAIL MENU DRAFT BEERS Ale Asylum Hopalicious American Pale Ale 5.7%, Madison, WI 5

DARK MATTER IN DSPH Paolo Salucci (& G. Gilmore) SISSA (Oxford) Outline of the Review Dark

Action Plan 2019 Update Impact Indicators & Project Progress Board Prepared for Internet