Large-Scale Adaptive Mesh Simulations Through Non-Volatile Byte-Addressable Memory Bao Nguyen Hua Tan Xuechen Zhang Kei Davis* *
Octree Meshing is Widely Used in HPC Simulation Droplet breakup Micro-boiling Droplet ejection 2
Quad/Octree-Based Adaptive Meshing R 10 9 6 7 8 1 6 4 5 1 3 2 9 10 2 7 8 4 5 3 Domain decomposition Quad/octree representation in DRAM Because models span larger length and time-scales, DRAM demand is significant even on supercomputers. 3
Per-core DRAM Capacity is Shrinking on Supercomputers Jaguar: 2.7-4 GB/core Titan: 2 GB/core Due to associated capital costs and power consumptions 4
Using Non-Volatile Byte-addressable Memory for Meshing Non-Volatility Byte-Addressability Speed Cost Power Flash Low Decreasing Yes No Low DRAM High Increasing No Yes High NVBM High* Decreasing Yes Yes Low Non-Volatility Byte-Addressability Speed Cost Power 5
Existing Applications were Not Designed for NVBM Linear octree[SC’ 07 ], parallel octree[SC’05], etc. In-core But they save snapshots on storage systems for failure Algorithms recovery; I/Os can be the bottleneck. Etree[SC’04], visualization[TVCG’97], etc. Out-of-core But they were designed for slow non-volatile Algorithms mediums, e.g., SSDs and HDDs. Can we support in-NVBM octree meshing bypassing slow I/O buses? 6
Challenge I: NVBM Writes Incur Higher Latency DRAM NVBM NVBM write latency is 2.5X greater than DRAM. Meshing operations (e.g., refinement) are write-intensive. 7
Challenge II: Existing Octrees Are Not Durable for NVBM After normal pointer writing After failed pointer writing 7 8 10 7 8 10 9 9 11 X A failure may cause the pointer to link to an undefined region in NVBM. 8
Challenge III: Difficult to Handle Special Pointers R Special pointers 1 6 7 8 10 2 4 5 9 3 NVBM DRAM Handling special pointers introduces extra complexity for application developers. . 9
Design Objectives of Persistent-Merged Octree + + In-NVBM meshing Hiding write Orthogonal & storage latency to NVBM persistence Persistent-merged octree (PM-octree) 10
PM-Octree Design: A Multi-Version Data Structure V i-1 V i Persistent Volatile NVBM DRAM +NVBM The persistent version provides the desired durability. 11
PM-Octree Design: Octant Sharing between Versions V i-1 V i C 1 tree NVBM Observation: many spatial Reduce the memory usage domains do not change in by up to 1.9X. adjacent time steps. . 12
PM-Octree Design: Partitioned Data Structure V i R V D i 1 6 3 5 8 9 2 4 7 10 C 0 tree in DRAM C 1 tree in NVBM Effectively use both DRAM and NVBM. 13
PM-Octree Design: Dynamic Layout Transformation V i R R V iD V iD 1 6 1 6 7 8 9 10 2 3 4 5 2 3 4 5 7 8 9 10 NVBM DRAM DRAM NVBM Layout transformation is periodically executed to hide NVBM write latency. 14
Putting Together the Components of PM-Octree V i-1 V i V i D C 1 tree C 1 tree C 0 tree DRAM NVBM A multi-version data structure for both in-memory meshing and storage. It provides near-instantaneous failure recovery by accessing memory bus. 15
Basic Operation: Octant Insertion Before inserting octant 11 After inserting octant 11 V i-1 V i-1 V i R R R ’ 11 1 1 6 6 u u u’ 2 3 4 5 7 8 9 10 2 3 4 5 7 8 9 10 9’ 11 16
Basic Operation: Octant Update Before updating octant 10 After updating octant 10 V i-1 V i V i V i-1 R R ’ R ’ R 1 6 u 1 6 u ’ u u ’ 2 3 4 5 7 8 9 10 9’ 2 3 4 5 7 8 9 10 9’ 10’ 11 11 17
PM-Octree Design: Orthogonal Persistence Routine Description create a new PM-octree; pmoctree ⋆ pm_create(octree ⋆ tree) return a pointer to V i create a persistent version of void pm_persistent(pmoctree ⋆ tree) octree restore a PM-octree; pmoctree ⋆ pm_restore(void) return a pointer to V i delete all octants on NVBM and void pm_delete(pmoctree ⋆ tree) DRAM We integrated it with Gerris flow solver. 18
Experimental Setting • Hardware Ø Titan at ORNL Ø Emulation of NVBM using DRAM Routine DRAM NVBM Read Latency (ns) 60 100 Write Latency (ns) 60 150 • Simulation • Droplet rotation and ejection 19
Comparison of Meshing Methods Objects Objects Method name Interface in DRAM in NVBM In-core-octree Octants Snapshot File System Octant Out-of-core-octree Cache File System record PM-octree Octants Octants Memory 20
Weak Scaling • 1.2M to 1077M elements • 1 to 1000 PEs • Number of element on each PE: ~1 million The execution time of PM-octree increases as a logarithm of problem size. 21
Execution Time Breakdown with Weak Scaling Tree partitioning overhead prevents from achieving an optimal speedup. 22
Strong Scaling • Problem size is 150 million elements • 240 to 1000 PEs Scalability of PM-octree is similar as in-core-octree. 23
Execution Time Breakdown with Strong Scaling No scalability issue because no major fluctuation is observed 24
Failure Recovery PM-octree reduces the failure recovery time by up to 20X. PM-octree guarantees data consistency after failures. 25
Conclusions • PM-octree effectively extends memory capacity using NVBM. • It scales as well as in-core algorithms. • It significantly reduces the time of recovery. • It provides easy-to-program interface. 26
Acknowledgments Xuechen Zhang xuechen.zhang@wsu.edu Bao Nguyen Hua Tan 27
Basic Operation: Octant Merging Before merging C 0 After merging C 0 V i-1 V i C 1 C 1 C 0 C 0 NVBM subtree DRAM subtree 28
Basic Operations: Persistent Before persistent After persistent V i+1 V i-1 V i V i R R R ’ R ’ 29
Layout Dynamic Transformation Execution time is reduced by 25% while the number of writes is reduced by up to 30%. 30
Impact of DRAM Size Varied memory sizes influence the merging frequency and execution time. 31
Recommend
More recommend