Resilient Distributed Concurrent Collections Cdric Bassem - PowerPoint PPT Presentation

Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1

Evolution of Performance in High Performance Computing Exascale = 10 18 Flop/s Petascale = 10 15 Flop/s (source: http://www.top500.org/statistics/perfdevel/) 2

Evolution of Failures in HPC Main Source: Hardware Faults (~ 50%) SMTTI = System Mean time to interrupt In Exascale SMTTI < 30 min Source: Franck Cappello (2009) 3

Resilience Resilience = Fault Tolerance Avizienis et al. (2004) “The collection of techniques for keeping applications running to a correct solution in a timely and efficient manner despite underlying system faults” Snir et al. (2014) 4

Coordinated Checkpoint/Restart 5

Asynchronous Checkpoint/Restart 6

Requirements for Asynchronous Checkpoint/Restart Reasoning about state: Self-aware, execution frontier Safe restart: Deterministic computation Data race free: Monotonically increasing state 7

Resilience in CnC Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA. Focused on shared memory CnC runtimes CnC Properties: ● Dependency graph ● Provable deterministic computation ● Single assignment data 8

The Concurrent Collections Model Checkpoint env 0 Tags 0 1 1 2 2 Fibs Results 9

The Concurrent Collections Model Checkpoint Tags 0 0:0 0 1 2 1 2 Fibs Results 0 0:0 10

The Concurrent Collections Model Checkpoint Tags 0 0:0 0 1 1:1 2 1 2 Fibs Results 0:0 1 1:1 11

The Concurrent Collections Model Checkpoint Tags 0 0:0 0 1 1:1 2 1 2 Fibs Results 0:0 1:1 2 12

The Concurrent Collections Model Checkpoint Tags 0:0 0 1:1 1 2 Fibs Results 13

The Concurrent Collections Model Checkpoint Tags 0:0 0 1:1 1 2 Fibs Results 14

The Concurrent Collections Model Checkpoint Tags 0:0 2 0 1:1 1 2 Fibs Results 0:0 1:1 2 2:1 15

The Concurrent Collections Model Checkpoint Tags 0:0 2 0 1:1 1 2 Fibs Results env 0:0 1:1 2:1 2:1 16

Proof of Concept Implementation Goal : Assessing the viability of Asynchronous C/R in distributed memory CnC runtimes Runtime: Intel(R) Concurrent Collections for C++ (Architect: Frank Schlimbach) Resilience Flavour : ● Dedicated checkpoint node ● Fine grained updates ● Uncoordinated restart 17

Dedicated Checkpoint Node & Fine grained Updates Updates contain: Node data instances consumed Checkpoint data instances produced Node Node control instances produced producers consumers Node 18

Restart 2 Restart simulation ➜ No fault tolerant MPI Node Uncoordinated ➜ Step duplication 1 3 Node Node Node 4 19

Memory Management in CnC Non-trivial: data accessed by dynamic steps One solution: get-counting method int getCountFib( FibTag t ) { if ( t > 0 ) { return 2; else { return 1; } } 20

Solution Extra bookkeeping in checkpoint: ➢ Consider steps only once when lowering get counts ○ Hashmap of considered steps ➢ Never re-add removed data instances ○ Marking data as removed 21

Modelling Overhead (Tw/Ts) Coordinated Checkpoint/Restart (Daly, 2006) Asynchronous Checkpoint/Restart 22

Evaluating Asynchronous Checkpoint/Restart 23

Benchmarks - Goals Assessing overhead factor (φ): Ok if high Method: Measure w/o resilience = Solve time (Ts) Measure with resilience = Wall clock time (Tw) Overhead factor = Tw/Ts Assessing restart time (Tr): Should be low Method: Measure time needed to calculate the restart set 24

Number of Steps Fibonacci Mandelbrot Overhead factor (φ): Increases with number of steps 25

Restart Time Restart Time (Tr): Low Optimization: Shifting some of the complexity to the overhead factor Fibonacci: Restart Time 26

Future Work Distributed Checkpoint: Checkpoint ➢ Overhead high but constant ➢ Restart time? Tag-only logging: ➢ Less communication ➢ Complex restart 27

Conclusion Asynchronous C/R distributed memory CnC runtime ➢ Analyzing different cases ➢ Proof of concept implementation Asynchronous C/R is viable for systems with low SMTTI ➢ Model ➢ Proof of concept implementation 28

References Daly, J. T. (2006). A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22(3), 303–312. Avizienis, A., Laprie, J., Randell, B., & Landwehr, C. E. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. Snir, M., Wisniewski, R. W., Abraham, J. A., Adve, S. V., Bagchi, S., Balaji, P., . . . Hensbergen, E. V. (2014). Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications, 28(2), 129–173. Franck Cappello (2009). Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge. International Journal of High Performance Computing, 23(1) , 212-226. Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA. 29

Resilient Distributed Concurrent Collections Cdric Bassem - PowerPoint PPT Presentation

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18 Flop/s Petascale = 10 15 Flop/s

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

The Natural Science Collections Facility Natural Science Collections Collections in South Africa

Scala Collections 1 / 20 Scala Collections Figure 1: Abstract classes and traits in

CONCURRENT COLLECTIONS 2 5/24/11 Concurrent Collec9ons

COMP 213 Advanced Object-oriented Programming Lecture 12 Java Collections. The Collections

Collections Objectives Explore collections in System.Collections namespace memory

Java Collections and Generics Object-oriented programming Inf1 :: 2008 Object-oriented

Concurrent Enrollment A Guide for Parents and Students What is Concurrent Enrollment? Concurrent

Concurrent Message Service M. Clemencic CERN - LHCb Forum on Concurrent Programming Models and

Concurrent Programming in Scala 1 / 7 Concurrent Programming 1 Concurrent programming:

Today private collections of books, art, and artifacts are often gifted or lent to libraries for

Using Online Collections of Materials held by the Division of Rare and Manuscript Collections

Introduction to Java Collections 6 What are collections? A collection sometimes called

COLLECTIONS WITH ALMA PUBLISHING April 27, 2020 Nicole Swanson, CARLI OCLC DATA SYNC COLLECTIONS

Introduction to Ansible Collections Ganesh Nalawade Principal Software Engineer Ansible

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N.,

Identifying Slow Queries, and Fixing Them! Stephen Frost Crunchy Data stephen@crunchydata.com

Presenter: Box. Lean Box. Leangsuksun gsuksun SWEPCO Endowed Professor*, Computer Science

Oasys PRIMER Did you know? Back to Contents Top Tips Demo Slide 2 Slide 2 Checkpoint

Incremental checkpointing of program state to NVRAM for transiently-powered systems Fayal

Checkpointing strategies for parallel jobs Marin Bougeret , Henri Casanova , Mika el Rabie , Yves

Shared Clusters Jack Li , Calton Pu Yuan Chen , Vanish Talwar, Dejan Milojicic Georgia Institute

TRAINING NEURAL TRAINING NEURAL NETWORKS ON THE NETWORKS ON THE EDGE EDGE Navjot Kukreja,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us