a self correcting graph connected component algorithm
play

A Self-correcting Graph Connected Component Algorithm Piyush Sao, - PowerPoint PPT Presentation

http://hpcgarage.org/ftxs16/ A Self-correcting Graph Connected Component Algorithm Piyush Sao, Oded Green, Chirag Jain, Richard Vuduc The 6 th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016 http://hpcgarage.org/ftxs16/ Piyush


  1. http://hpcgarage.org/ftxs16/ A Self-correcting Graph Connected Component Algorithm Piyush Sao, Oded Green, Chirag Jain, Richard Vuduc The 6 th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016 http://hpcgarage.org/ftxs16/ Piyush Sao Fault tolerant graph computing FTXS16 1 / 26

  2. Summary http://hpcgarage.org/ftxs16/ Summary of Contributions Self-correcting Algorithms We introduce a new fault tolerant algorithm design principle that we call self-correction . A self-correcting algorithm remains in a valid state, despite the faulty execution of an iteration, under the assumption that its previous state was a valid one. Self-Correcting Connected Components Algorithm We apply the ideas of self-correction to Label-propagation algorithm for graph connected component problem. Assumes availability of selective reliability mode. � � Requires O V additional storage and computations per iteration compared to � � O | V | + | E | cost for the baseline algorithm. 10-35% increases in execution time for one error for 64 memory operations. Piyush Sao Fault tolerant graph computing FTXS16 2 / 26

  3. Summary http://hpcgarage.org/ftxs16/ Summary of Contributions Self-correcting Algorithms We introduce a new fault tolerant algorithm design principle that we call self-correction . A self-correcting algorithm remains in a valid state, despite the faulty execution of an iteration, under the assumption that its previous state was a valid one. Self-Correcting Connected Components Algorithm We apply the ideas of self-correction to Label-propagation algorithm for graph connected component problem. Assumes availability of selective reliability mode. � � Requires O V additional storage and computations per iteration compared to � � O | V | + | E | cost for the baseline algorithm. 10-35% increases in execution time for one error for 64 memory operations. Piyush Sao Fault tolerant graph computing FTXS16 2 / 26

  4. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Iterative Algorithms Input Problem (I) Initialize S 0 Iterative Algorithms A typical iterative algorithm has following Update S k components: S k+1 =F(S k ) k=k+1 The input problem; 1 Intermediate variable; 2 Update relation; 3 Convergence checking; and 4 No Converged? Solution. 5 Yes Report Solution Piyush Sao Fault tolerant graph computing FTXS16 3 / 26

  5. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Iterative Algorithm as State Machine S 0 Iterative Algorithms as State Machines An iterative algorithm can be viewed as state S 1 machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. S 2 Starts with an initial state S 0 Uses update relation to transition from one state to another S k +1 ← S k S n Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

  6. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Iterative Algorithm as State Machine S 0 Iterative Algorithms as State Machines An iterative algorithm can be viewed as state S 1 machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. S 2 Starts with an initial state S 0 Uses update relation to transition from one state to another S k +1 ← S k S n Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

  7. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Iterative Algorithm as State Machine S 0 Iterative Algorithms as State Machines An iterative algorithm can be viewed as state S 1 machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. S 2 Starts with an initial state S 0 Uses update relation to transition from one state to another S k +1 ← S k S n Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

  8. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Iterative Algorithm as State Machine S 0 Iterative Algorithms as State Machines An iterative algorithm can be viewed as state S 1 machine. State of the algorithm: subset of intermediate variables required for continued execution of the algorithm. S 2 Starts with an initial state S 0 Uses update relation to transition from one state to another S k +1 ← S k S n Piyush Sao Fault tolerant graph computing FTXS16 4 / 26

  9. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Single Fault in Iterative Algorithm S 0 Valid and Invalid States Valid state: under fault-free execution of the algorithm from that state, the S 1 algorithm will converge to the correct Fault solution; otherwise invalid. In fault free execution, the algorithm S 2 S f always remains in a valid state. Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial. S n Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

  10. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Single Fault in Iterative Algorithm S 0 Valid and Invalid States Valid state: under fault-free execution of the algorithm from that state, the S 1 algorithm will converge to the correct Fault solution; otherwise invalid. In fault free execution, the algorithm S 2 S f always remains in a valid state. Valid State Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial. S n Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

  11. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Single Fault in Iterative Algorithm S 0 Valid and Invalid States Valid state: under fault-free execution of the algorithm from that state, the S 1 algorithm will converge to the correct Fault solution; otherwise invalid. In fault free execution, the algorithm S 2 S f always remains in a valid state. Invalid State Any hardware fault can cause the algorithm to reach an invalid state. In general determining whether a given state is valid or not, is non-trivial. S n S fn Piyush Sao Fault tolerant graph computing FTXS16 5 / 26

  12. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Self-stabilizing Algorithms Self-stabilizing Algorithms Starting from any arbitrary state, valid or invalid, a self-stabilizing algorithm Arbitrary State will reach a valid in finite number of steps. Natural fault-tolerance mechanism. Examples: Stationary iterations, Newton Iteration. Valid State Self-stabilization is a strong property. Scala’13 Self-stabilizing Steepest Descent and Conjugate Gradient. Solution State Periodic state correction. May not be generalized to all iterative algorithms. Piyush Sao Fault tolerant graph computing FTXS16 6 / 26

  13. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Self-Correcting Algorithms S 0 Self-correcting Algorithms A self-correcting algorithm is an iterative algorithm that, starting in S 1 some valid state, remains in a valid state or comes to a valid state in finite Fault number of steps even if a fault occurs. Arbitrary State S 2 Requires that algorithm starts from a valid state. Uses information from previously Valid State known valid state. Example: Checkpoint-restart, S n FT-GMRES. Piyush Sao Fault tolerant graph computing FTXS16 7 / 26

  14. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Checkpoint-restart as a Self-correcting algorithm S 0 Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. S 1 At high fault rate, algorithm will not Fault make any progress. S 2 Arbitrary State Broader idea of self-correction is to use S 1 to construct an state S 1 ˜ S 2 ≈ S 2 S n Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

  15. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Checkpoint-restart as a Self-correcting algorithm S 0 Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. S 1 Restart At high fault rate, algorithm will not Fault make any progress. S 2 Arbitrary State Broader idea of self-correction is to use S 1 to construct an state ˜ S 2 ≈ S 2 S n Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

  16. Self-correcting Algorithms http://hpcgarage.org/ftxs16/ Checkpoint-restart as a Self-correcting algorithm S 0 Checkpoint-restart based fault tolerance Bring to valid state by restoring a check-pointed valid state. S 1 At high fault rate, algorithm will not Fault make any progress. S 2 Arbitrary State Broader idea of self-correction is to use S 1 to construct an state ˜ S 2 ≈ S 2 S n Piyush Sao Fault tolerant graph computing FTXS16 8 / 26

  17. Self-correcting Label Correction Algorithm http://hpcgarage.org/ftxs16/ Label Propagation Algorithm for Graph Connected Component Algorithm 2 0 3 8 1 6 7 5 4 Graph Connected-component Problem We seek to find number of connected-components in the graph and which connected component each vertex belongs to. Used for community detection, centrality analytics and streaming graph analytics. Label propagation is highly suited for parallel computing. Piyush Sao Fault tolerant graph computing FTXS16 9 / 26

Recommend


More recommend