Accelerating Parallel Analysis of Scientific Simulation Data via - PowerPoint PPT Presentation

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research

Motivation � Goal: To model biological processes that occur on the millisecond time scale � Approach: A specialized, massively parallel supercomputer called Anton (2009 ACM Gordon Bell Award for Special Achievement) 2

Millisecond-scale MD Trajectories A biomolecular system: 25 K atoms × Position and velocity: 24 bytes/atom Frame size: 0.6 MB/frame Simulation length: 1x 10 -3 s 10 x 10 -12 s ÷ Output interval: Number of frames: 100 M frames 3

Part I: How We Analyze Simulation Data in Parallel 4

An MD Trajectory Analysis Example: Ion Permeation 5

A Hypothetic Trajectory 20,000 atoms in total; two ions of interest 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.5 -1 -1.5 -2 Ion A Ion B 6

Ion State Transition S Above channel Into Into S channel channel Inside from channel from below above Below S channel 7

Typical Sequential Analysis � Maintain a main-memory resident data structure to record states and positions � Process frames in ascending simulated physical time order Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition 8

Problems with Sequential Analysis Millisecond-scale trajectory size : 60 TB Local disk read bandwidth : 100 MB / s Time to fetch data to memory : 1 week Analysis time : Varied Time to perform data analysis : Weeks Sequential analysis lack the computational, memory, and I/O capabilities! 9

A Parallel Data Analysis Model Specify which frames to be accessed Decouple data acquisition from data analysis Trajectory definition Stage1: Per-frame data acquisition Stage 2: Cross-frame data analysis 10

Trajectory Definition Every other frame in the trajectory 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.5 -1 -1.5 -2 Ion A Ion B 11

Per-frame Data Acquisition (stage 1) P0 P1 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.5 -1 -1.5 -2 Ion A Ion B 12

Cross-frame Data Analysis (stage 2) Analyze ion A on P0 and ion B on P1 in parallel 2 1.5 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 -0.5 -1 -1.5 -2 Ion A Ion B 13

Inspiration: Google’s MapReduce Output file Output file Google File System Input files Input files Input files map(...) map(...) map(...) K1: {v1 i } K1: {v1 j } K1: {v1 k } K2: {v2 i } K2: {v2 j } K2: {v2 k } K1: K2: {v1 j, v1 i, v1 k } {v2 k, v2 j, v2 i } reduce(K1, ...) reduce(K2, ...) 14

Trajectory Analysis Cast Into MapReduce � Per-frame data acquisition (stage 1): map() � Cross-frame data analysis (stage 2): reduce() � Key-value pairs: connecting stage1 and stage2 � Keys: categorical identifiers or names � Values: including timestamps � Examples: (ion_id j , (t k , x ik , y jk , z jk )) Key Value 15

The HiMach Library � A MapReduce-style API that allows users to write Python programs to analyze MD trajectories � A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically � Performance results on a Linux cluster: � 2 orders of magnitude faster on 512 cores than on a single core 16

Typical Simulation–Analysis Storage Infrastructure Analysis cluster Analysis node Analysis node Parallel analysis programs Local Local disks disks File servers I/O node Parallel supercomputer 17

Part II: How We Overcome the I/O Bottleneck in Parallel Analysis 18

Trajectory Characteristics � A large number of small frames � Write once, read many � Distinguishable by unique integer sequence numbers � Amenable to out-of-order parallel access in the map phase 19

Our Main Idea � At simulation time, actively cache frames in the local disks of the analysis nodes as the frames become available � At analysis time, fetch data from local disk caches in parallel 20

Limitations � Require large aggregate disk capacity on the analysis cluster � Assume relatively low average simulation data output rate 21

An Example Analysis node 0 Analysis node 1 Merged bitmap 1 1 1 1 Merged bitmap 1 1 1 1 Remote bitmap 1 0 1 0 Remote bitmap 0 1 0 1 Local bitmap 0 1 0 1 Local bitmap 1 0 1 0 / / bodhi bodhi sim0 sim0 sim1 sim1 seq seq f1 f3 f0 f2 1 0 3 2 NFS server / sim0 sim1 f0 f1 f2 f3 22

How to guarantee that each frame is read by one and only one node in the face of node failure and recovery? The Zazen Protocol 23

The Zazen Protocol � Execute a distributed consensus protocol before performing actual disk I/O � Assign data retrieval tasks in a location- aware manner � Read data from local disks if the data are already cached � Fetch missing data from file servers � No metadata servers to keep record of who has what 24

The Zazen Protocol (cont’d) � Bitmaps: a compact structure for recording the presence or non-presence of a cached copy � All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice) 25

Implementation Zazen cluster Analysis node Analysis node � The Bodhi library Parallel analysis programs (HiMach jobs) � The Bodhi server Zazen protocol Bodhi library Bodhi library � The Zazen protocol Bodhi Bodhi server server File servers Bodhi library I/O node Parallel supercomputer 26

Performance Evaluation 27

Experiment Setup � A Linux cluster with 100 nodes � Two Intel Xeon 2.33 GHz quad-core processors per node � Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node � 16 GB physical memory per node � CentOS 4.6 with a Linux kernel of 2.6.26 � Nodes connected to a Gigabit Ethernet core switch � Common accesses to NFS directories exported by a number of enterprise storage servers 28

Fixed-Problem-Size Scalability Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames 16 14 12 10 Time (s) 8 6 4 2 0 1 2 4 8 16 32 64 128 Number of nodes 29

Fixed-Cluster-Size Scalability Execution time of the Zazen protocol on 100 nodes assigning different number of frames 1E+02 1E+01 Time (s) 1E+00 1E-01 1E-02 1E-03 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 Number of frames 30

Efficiency I: Achieving Better I/O BW One Bodhi daemon One Bodhi daemon per analysis node per user process 25 25 1-GB 256-MB 20 20 64-MB GB/s GB/s 15 15 2-MB 10 10 5 5 0 0 1 2 4 8 1 2 4 8 Application read processes Application read processes per node per node 31

Efficiency II: Comparison w. NFS/PFS � NFS (v3) on separate enterprise storage servers • Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48 SATA disks organized in RAID 6 • Four 1 GigE connection to the core switch of the 100-node cluster � PVFS2 (2.8.1) on the same 100 analysis nodes • I/O (data) server and metadata server on all nodes • File I/O performed via the PVFS2 Linux kernel interface � Hadoop/HDFS (0.19.1) on the same 100 nodes • Data stored via HDFS’s C library interface, block sizes set to be equal to file sizes, three replications per file • Data accessed via a read-only Hadoop MapReduce Java program (with a number of best-effort optimizations) 32

Efficiency II: Outperforming NFS/PFS I/O bandwidth of reading files of different sizes 25 NFS PVFS2 Hadoop/HDFS Zazen 20 15 GB/s 10 5 0 2 MB 64 MB 256 MB 1 GB File size for read 33

Efficiency II: Outperforming NFS/PFS Time to read one terabyte of data 1E+05 NFS PVFS2 Hadoop/HDFS 1E+04 Zazen Time (s) 1E+03 1E+02 1E+01 2 MB 64 MB 256 MB 1 GB File size for read 34

Read Perf. under Writes (1GB/s) File size for writes No writes 1 GB files 256 MB files 64 MB files 2 MB files 100 Normalized performance 90 80 70 60 50 40 30 20 10 0 2 MB 64 MB 256 MB 1 GB File size for reads 35

End-to-End Performance � A HiMach analysis program called water residence on 100 nodes � 2.5 million small frame files (430 KB each) 10,000 Time (s) 1,000 NFS Zazen Memory 100 1 2 4 8 Application processes per node 36

Robustness � Worst case execution time is T(1 + δ (B/b) ) � The water-residence program re-executed with varying number of nodes powered off 1,600 Theoretical worst case 1,400 Actual running time 1,200 Time (s) 1,000 800 600 400 200 0% 10% 20% 30% 40% 50% Node failure rate 37

Summary Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol. � Simple and robust � Scalable on a large number of nodes � Much higher performance than NFS/PFS * � Applicable to a certain class of time- dependent simulation datasets * 38

Accelerating Parallel Analysis of Scientific Simulation Data via - PowerPoint PPT Presentation

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation Goal: To model biological

Outline Introduction Space-Time Simulation Time Parallel Simulation Fix-up

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

A Parallel Strategy for a Level Set Simulation of A Parallel Strategy for a Level Set Simulation of

OR OR and and 2 Simulation Simulation for for Health Care Optimization Health Care

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Early Morning Upward Early Morning Upward Ion Drifts at the Ion Drifts at the Geomagnetic

(ESA PROJECT A06041) Ions flux on the Earth orbit Proton component of Cosmic Rays dominates

SURFACE PHYSICS SURFACE PHYSICS OF OF TOPOLOGICAL INSULATORS TOPOLOGICAL INSULATORS MASSLESS

ICTP/Psi-k/CECAM School on Electron-Phonon Physics from First Principles Trieste, 19-23 March

Operation of CEBAF photoguns at average beam current > 1 mA M. Poelker, J. Grames, P.

Performance Comparison of DTN Bundle Protocol Implementations ottner , Johannes Morgenroth,

Space charge effects in liquid argon TPCs and ion mobility measurement Roberto Santorelli CIEMAT

SPECIAL TOPICS IN ION BEAM ANALYSIS PART 2 SINGLE ION TECHNIQUES: STIM & IBIC Milko

Accelerating Parallel Analysis of Scientific Simulation Data via - PowerPoint PPT Presentation

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation Goal: To model biological

Outline Introduction Space-Time Simulation Time Parallel Simulation Fix-up

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

A Parallel Strategy for a Level Set Simulation of A Parallel Strategy for a Level Set Simulation of

OR OR and and 2 Simulation Simulation for for Health Care Optimization Health Care

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Automated Configuration of Co-simulation with Domain Specific Hints Co-simulation on the rise

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Early Morning Upward Early Morning Upward Ion Drifts at the Ion Drifts at the Geomagnetic

(ESA PROJECT A06041) Ions flux on the Earth orbit Proton component of Cosmic Rays dominates

SURFACE PHYSICS SURFACE PHYSICS OF OF TOPOLOGICAL INSULATORS TOPOLOGICAL INSULATORS MASSLESS

ICTP/Psi-k/CECAM School on Electron-Phonon Physics from First Principles Trieste, 19-23 March

Operation of CEBAF photoguns at average beam current &gt; 1 mA M. Poelker, J. Grames, P.

Performance Comparison of DTN Bundle Protocol Implementations ottner , Johannes Morgenroth,

Space charge effects in liquid argon TPCs and ion mobility measurement Roberto Santorelli CIEMAT

SPECIAL TOPICS IN ION BEAM ANALYSIS PART 2 SINGLE ION TECHNIQUES: STIM &amp; IBIC Milko

Operation of CEBAF photoguns at average beam current > 1 mA M. Poelker, J. Grames, P.

SPECIAL TOPICS IN ION BEAM ANALYSIS PART 2 SINGLE ION TECHNIQUES: STIM & IBIC Milko