ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin ¹ Pritish Jetley ¹ Celso Mendes ¹ Laxmikant Kale ¹ Thomas Quinn ² ¹ University of Illinois at Urbana-Champaign ² University of Washington 1
Outline ● Motivations ● Algorithm overview ● Scalability ● Load balancer ● Multistepping 2 Parallel Programming Laboratory @ UIUC 04/23/07
Motivations ● Need for simulations of the evolution of the universe ● Current parallel codes: – PKDGRAV – Gadget ● Scalability problems: – load imbalance – expensive domain decomposition – limit to 128 processors 3 Parallel Programming Laboratory @ UIUC 04/23/07
ChaNGa: main characteristics ● Simulator of cosmological interaction – Newtonian gravity – Periodic boundary conditions – Multiple timestepping ● Particle based (Lagrangian) – high resolution where needed – based on tree structures ● Implemented in Charm++ – work divided among chares called TreePiece s – processor-level optimization using a Charm++ group called CacheManager 4 Parallel Programming Laboratory @ UIUC 04/23/07
Space decomposition TreePiece 1 TreePiece 2 TreePiece 3 ... 5 Parallel Programming Laboratory @ UIUC 04/23/07
Basic algorithm ... ● Newtonian gravity interaction – Each particle is influenced by all others: O( n ² ) algorithm ● Barnes-Hut approximation: O( n log n ) – Influence from distant particles combined into center of mass 6 Parallel Programming Laboratory @ UIUC 04/23/07
... in parallel ● Remote data – need to fetch from other processors ● Data reusage – same data needed by more than one particle 7 Parallel Programming Laboratory @ UIUC 04/23/07
Overall algorithm Processor n Processor 1 Start computation TreePiece C TreePiece B TreePiece A miss local work global work remote CacheManager TreePiece on Processor 2 (low priority) global work remote local work work prefetch NO: fetch request node local work present? (low priority) reply with prefetch work remote visit of (low priority) requested data YES: return the tree visit of c a the tree l l b a c k buffer End computation High priority High priority 8 Parallel Programming Laboratory @ UIUC 04/23/07
Systems Procs Memory System Location Procs CPU Network per node per node Tungsten NCSA 2,560 2 Xeon 3.2 Ghz 3 GB Myrinet Cray XT3 Pittsburgh 4,136 2 Opteron 2.6GHz 2 GB Torus BlueGene/L IBM-Watson 40,000 2 Power440 700MHz 512 MB Torus 10 Parallel Programming Laboratory @ UIUC 04/23/07
Scaling: comparison lambs 3M on Tungsten 11 Parallel Programming Laboratory @ UIUC 04/23/07
Scaling: IBM BlueGene/L 12 Parallel Programming Laboratory @ UIUC 04/23/07
Scaling: Cray XT3 13 Parallel Programming Laboratory @ UIUC 04/23/07
Load balancing with OrbLB lambs 5M on 1,024 BlueGene/L processors processors time white is good 15 Parallel Programming Laboratory @ UIUC 04/23/07
Scaling with load balancing Number of Processors x Execution Time per Iteration (s) 17 Parallel Programming Laboratory @ UIUC 04/23/07
Multistepping ● Particles with higher accelerations require smaller integration timesteps to be accurately predicted. ● Compute particles with highest accelerations every step, and particles with lower accelerations every few steps. ● Steps become different in terms of load. 18 Parallel Programming Laboratory @ UIUC 04/23/07
ChaNGa scalability - multistepping dwarf 5M on Tungsten 19 Parallel Programming Laboratory @ UIUC 04/23/07
ChaNGa scalability - multistepping 20 Parallel Programming Laboratory @ UIUC 04/23/07
Future work ● Adding new physics – Smoothed Particle Hydrodynamics ● More load balancer / scalability – Reducing overhead of communication – Load balancing without increasing communication volume – Multiphase for multistepping – Other phases of the computation 21 Parallel Programming Laboratory @ UIUC 04/23/07
Questions? Thank you 22 Parallel Programming Laboratory @ UIUC 04/23/07
Decomposition types ● OCT – Contiguous cubic volume of space to each TreePiece ● SFC – Morton and Peano-Hilbert – Space Filling Curve imposes total ordering of particles – Segment of this line to each TreePiece ● ORB – Space divided by Orthogonal Recursive Bisection on the number of particles – Contiguous non-cubic volume of space to each TreePiece – Due to the shapes of the decomposition, requires more computation to produce correct results 23 Parallel Programming Laboratory @ UIUC 04/23/07
Serial performance Execution Time on Tungsten (in seconds) Lambs datasets Simulator 30,000 300,000 1,000,000 3,000,000 PKDGRAV 0.8 12.0 48.5 170.0 ChaNGa 0.8 13.2 53.6 180.6 Time difference 0.00% 9.09% 9.51% 5.87% 24 Parallel Programming Laboratory @ UIUC 04/23/07
CacheManager importance 1 million lambs dataset on HPCx Number of Processors 4 8 16 32 64 Number of No Cache 48,723 59,115 59,116 68,937 78,086 messages With Cache 72 115 169 265 397 (in thousand) No Cache 730.7 453.9 289.1 67.4 42.1 Time (seconds) With Cache 39.0 20.4 11.3 6.0 3.3 Speedup 18.74 22.25 25.58 11.23 12.76 25 Parallel Programming Laboratory @ UIUC 04/23/07
Prefetching 1) explicit ● before force computation, data is requested for preload 2) implicit in the cache ● computation performed with tree walks ● after visiting a node, its children will likely be visited ● while fetching remote nodes, the cache prefetches some of its children 26 Parallel Programming Laboratory @ UIUC 04/23/07
Cache implicit prefetching 0 lambs dataset on 64 processors of Tungsten 8.05 65 1 60 8 Memory Consumption (in MB) Execution Time (in seconds) 55 7.95 50 7.9 45 2 7.85 40 Time 7.8 35 Memory 30 7.75 25 7.7 20 7.65 15 3 7.6 10 7.55 5 7.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Cache Prefetch Depth 27 Parallel Programming Laboratory @ UIUC 04/23/07
Charm++ Overview S ystem view P 2 P 3 P 1 User view ● mapping of objects to ● work decomposed into processors transparent to user objects called chares ● automatic load balancing ● message driven ● communication optimization 29 Parallel Programming Laboratory @ UIUC 04/23/07
Tree decomposition ● Exclusive ● Shared ● Remote TreePiece 1 TreePiece 2 TreePiece 3 30 Parallel Programming Laboratory @ UIUC 04/23/07
Space decomposition TreePiece 1 TreePiece 2 TreePiece 3 ... 31 Parallel Programming Laboratory @ UIUC 04/23/07
Overall algorithm Processor n Processor 1 Start computation TreePiece C TreePiece B miss local work TreePiece A remote CacheManager TreePiece on Processor 2 (low priority) remote local work work NO: fetch request node present? (low priority) remote global reply with local work work requested data work YES: return (low priority) c a l l b a c k buffer End computation 32 Parallel Programming Laboratory @ UIUC 04/23/07
Scalability comparison (old result) dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling 33 Parallel Programming Laboratory @ UIUC 04/23/07
ChaNGa scalability (old results) results on BlueGene/L flat: perfect scaling diagonal: no scaling 34 Parallel Programming Laboratory @ UIUC 04/23/07
Interaction list TreePiece A • X 35 Parallel Programming Laboratory @ UIUC 04/23/07
Interaction lists opening criteria cut-off Node X node X is undecided • node X is accepted node X is opened 36 Parallel Programming Laboratory @ UIUC 04/23/07
Interaction list walk 2) Interaction List Check List Node X Node X • Interaction List Interaction List Check List Check List Node X Node X Interaction List Interaction List Check List Double simultaneous walk in two copies Interaction List Check List of the tree: Interaction List Interaction List Children of X 1) force computation 2) exploit this observation 37 Parallel Programming Laboratory @ UIUC 04/23/07
Interaction list: results Number of checks for opening criteria, in millions lambs 1M dwarf 5M Original code 120 1,108 Interaction list 66 440 dwarf 5M on HPCx 100% 90% 80% ● 10% average 70% Relative time 60% performance Original 50% Interaction lists 40% improvement 30% 20% 10% 0% 32 64 128 256 512 Number of processors 38 Parallel Programming Laboratory @ UIUC 04/23/07
Load balancer dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad 40 Parallel Programming Laboratory @ UIUC 04/23/07
ChaNGa scalability flat: perfect scaling diagonal: no scaling 41 Parallel Programming Laboratory @ UIUC 04/23/07
Recommend
More recommend