Accelerating Atomistic Simulation on Many-core Computing Platform - PowerPoint PPT Presentation

Accelerating Atomistic Simulation on Many-core Computing Platform Liu Peng Collaboratory for Advanced Computing & Simulations Computer Science Department University of Southern California UnConvential High Performance Computing Euro Par 2010, Ischia, Naples, Italy

Atomistic Simulation Molecular Dynamics (MD) Linked-list cell method for MD Atom j k r ij r ik i Irregular memory access d 2 dt 2      m i E MD { r i } Frequent communication  r i ฀ 

GodsonT Many-core Computing Platform 64 core GodsonT many-core architecture • 64 homogenous, dual-issue core 1GHz, 128Gflops in total • lightweight hardware thread • Explicit memory hierarchy • 16 shared L2 cache banks, 256KB each • High bandwidth on-chip network: 2TB/s

Optimization Strategy I Adaptive Divide-and-Conquer(ADC) • Purpose: estimate the upper bound of decomposition cell size where all data can fit into each core’s local storage (SPM). • Solution: recursively do cellular decomposition until the following Equation (adaptive to the size of each core’s SPM) is satisfied. Estimation   PC L       of the size of pm   3 N qR B C R 3   b c pm c    P PN L Bq all data in a b cell with cell size of Rc ADC + software controlled memory (decide when and where the data reside in SPM ) to enhance the data usage.

Optimization Strategy II Data Layout Optimization • Purpose: ensure contiguous touching of data in each cell. • Solution: data grouping/reordering + local-ID centered addressing. Group Neighbor Data in L2 cache/ off-chip memory

Optimization Strategy III On-chip Locality Optimization • Purpose: maximize data reuse for each cell. • Solution: pre-processing to achieve locality-awareness, and further use locality-awareness to maximize data reuse. Maximize Data reuse If cell_ k in core_ i , Then parallel processing to achieve use PC to get all interactive cells locality-awareness exhaust all the inter- computation . architecture mechanism support for high-bandwidth core-core communication

Optimization Strategy IV Pipelining Algorithm • Purpose: hide latency to access off-chip memory • Solution: pipelining implemented via double buffered, asynchronized DTA operations. 1. tag 1 = tag 2 = 0 2. for each cell c core_i [ k ] listed in PC [ cj ] Maximize data reuse if ( tag 1 ≠ tag 2 ) 3. 4. DTA_ASYNC(spm_buf[1- tag 2 ], l2_dta_unit[ c core_i [ k ]]) If the interactive cell j is 5. tag 2 = 1- tag 2 6. endif not in the same core, 7. calculate atomic interactions between Issue memory transfer c core_i [ k ] and cj spm_buf[ tag 1 ] ← cell c core_i [ k ] ‘s 8. pipeline neighbor atomic data 9. tag 1 = 1- tag 1 10. endfor If the interactive cell j is if ( tag 1 ≠ tag 2 ) 11. 12. DTA_ASYNC(spm_buf[1- tag 2 ], already in the same core, l2_dta_unit[ c core_i [ k ]]) Do computation 13. tag 2 = 1- tag 2 14. endif

Performance Tests FPGA emulator for On-chip strong scalability 64 core GodsonT optimization-1:only ADC optimization-4: all 4 optimizations Excellent strong-scaling multithreading parallel efficiency of 0.99 on 64 cores with 24,000 atoms. (0.65 on 8-core multi-core)

Performance Analysis Running time L2 Cache performance 1796 Running time is reduced two All L2 cache events are reduced greatly. times.

Performance Analysis Remote memory access performance Optimization-2 Optimization-3 Number of remote memory accesses is reduced to 7%.

Performance Model of Many-core Parallel System Decent strong-scaling parallel efficiency over 0.9 up to billion processing elements with various core-core communication latency.

Conclusion 1. Locality optimization utilizing architecture mechanism benefit strong scalability most. 2. Many-core architecture has the potential for future exascale parallel system. Thanks! Research supported by ARO-MURI, DOE-SciDAC/BES, DTRA, NSF-ITR/PetaApps/CSR UnConvential High Performance Computing Euro Par 2010, Ischia, Naples, Italy

Accelerating Atomistic Simulation on Many-core Computing Platform - PowerPoint PPT Presentation

Accelerating Atomistic Simulation on Many-core Computing Platform Liu Peng Collaboratory for Advanced Computing & Simulations Computer Science Department University of Southern California UnConvential High Performance Computing Euro Par

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Atomistic Model of Ferrimagnetic GdFeCo Thomas Ostler, Richard Evans and Roy Chantrell October 2,

Advanced micromagnetics and atomistic simulations of magnets Richard F L Evans ESM 2018 Overview

Advanced micromagnetics and atomistic simulations of magnets Richard F L Evans ESM 2018 Overview

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

FRG: Multiscale Simulation of Atomistic Processes in Nanostructured Materials Rajiv K. Kalia,

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Cladding Interaction using advanced 3D Characterisation and Atomistic Simulation P. Frankel, A.

ATOMISTIC SIMULATION OF THE MECHANICAL BEHAVIORS OF CU/SIC NANOCOMPOSITES Z. Y. Yang * , Z. X. Lu,

Atomistic Simulation Methods Arthur F. Voter Los Alamos National Laboratory Robert Averback

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

ECS 289M Lecture 6 April 12, 2006 Safety Result If the scheme is acyclic and attenuating,

To Preserve or Not to Preserve Invalid Solutions in Search-Based Software Engineering: A Case

Interactions between Software Product Lines and Adversarial Machine Learning Paul TEMPLE 1 Gilles

Volumetric and Multi-View CNNs for Object Classification on 3D Data Charles R. Qi, Hao Su,

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

parallel programming Paolo Burgio paolo.burgio@unimore.it Definitions Parallel computing

Lasers and LEDs Lasers and LEDs Lasers produce narrow beams of intense light Lasers produce

Accelerating Atomistic Simulation on Many-core Computing Platform - PowerPoint PPT Presentation

Accelerating Atomistic Simulation on Many-core Computing Platform Liu Peng Collaboratory for Advanced Computing & Simulations Computer Science Department University of Southern California UnConvential High Performance Computing Euro Par

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Atomistic Model of Ferrimagnetic GdFeCo Thomas Ostler, Richard Evans and Roy Chantrell October 2,

Advanced micromagnetics and atomistic simulations of magnets Richard F L Evans ESM 2018 Overview

Advanced micromagnetics and atomistic simulations of magnets Richard F L Evans ESM 2018 Overview

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

FRG: Multiscale Simulation of Atomistic Processes in Nanostructured Materials Rajiv K. Kalia,

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Cladding Interaction using advanced 3D Characterisation and Atomistic Simulation P. Frankel, A.

ATOMISTIC SIMULATION OF THE MECHANICAL BEHAVIORS OF CU/SIC NANOCOMPOSITES Z. Y. Yang * , Z. X. Lu,

Atomistic Simulation Methods Arthur F. Voter Los Alamos National Laboratory Robert Averback

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

ECS 289M Lecture 6 April 12, 2006 Safety Result If the scheme is acyclic and attenuating,

To Preserve or Not to Preserve Invalid Solutions in Search-Based Software Engineering: A Case

Interactions between Software Product Lines and Adversarial Machine Learning Paul TEMPLE 1 Gilles

Volumetric and Multi-View CNNs for Object Classification on 3D Data Charles R. Qi*, Hao Su*,

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

parallel programming Paolo Burgio paolo.burgio@unimore.it Definitions Parallel computing

Lasers and LEDs Lasers and LEDs Lasers produce narrow beams of intense light Lasers produce

Volumetric and Multi-View CNNs for Object Classification on 3D Data Charles R. Qi, Hao Su,