Self-stabilizing Iterative Solvers Piyush Sao, Richard Vuduc School of Computational Science & Engineering Georgia Institute of Technology SIAM PP-14 P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 1 / 21
Introduction Self-stabilization Informally, self-stabilization (Dijkstra 1974) is a property of a system that guarantees it will enter a valid state no matter what its initial state is. We describe a self-stabilizing version the conjugate gradients method, which is resilient to transient soft faults. P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 2 / 21
Introduction Fault tolerant Iterative Algorithms Can an iterative algorithm still converge if a fault has occured ? P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 3 / 21
. > < ... , , z 1 y 1 x 1 < y 1 ... , , z 0 y 0 x 0 . . > z 1 > ... ... , , z y > < , , , z 2 y 2 x 2 > < ... , . < > Return , , z 0 y 0 x 0 Intermediate Vars Start Update < > < ... , , . convergence Check for ... > ... , , , z s y s x s > < ... , x k . > < ... , , z k y k < Introduction Iterative Algorithms x k+1 y k+1 z k+1 x k+1 y k+1 z k+1 P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 4 / 21
Invalid States . Valid States Solution States Invalid States <Xs', Ys', Zs', ...> <Xi, Yi, Zi, ...> <Xf, Yf, Zf, ...> . . . . . Start <Xs, Ys, Zs, ...> <Xk, Yk, Zk, ...> <X1, Y1, Z1, ...> . . . Self-stabilizing Algorithms Introduction Self-stabilizing Algorithms An algorithm is self-stabilizing , if starting from any state faulty Execution (valid or invalid), it comes back to a valid . . . state within finite number of“steps”, otherwise not. P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 5 / 21
Introduction Making an Algorithm Self-stabilizing Naturally self-stabilizing (e.g., Newton, SOR, Jacobi) Restart from a checkpoint Restart (such as restarted-GMRES) Our strategy: Correction step P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 6 / 21
Introduction Periodic correction step Restore sufficient conditions for convergence Mathematically“equivalent”to original in a fault-free execution Eliminates need for detecting faults Executing correction step periodically ensures resuming correct behavior in finite number of steps P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 7 / 21
Introduction Self-stabilizing Conjugate Gradient Conjugate Gradient Algorithm Solve Ax = b for x for SPD A ; Quadratic optimization problem F ( x ) = 1 2 x T Ax − x T b F ( x ) represents N-dimensional paraboloid CG finds the optimum by taking appropriately constructed steps P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 8 / 21
Introduction Self-stabilizing Conjugate Gradient Conjugate Gradient (CG) Algorithm State variables Transition function x k = present estimate 1 . ← q k Ap k p k = search direction r T r 2 . ← α k r k = b − Ax k = p T q 3 . ← x k + α k r k x k +1 direction of steepest 4 . r k +1 ← r k − α k q k descent � r k +1 � 2 5 . ← β k � r k � 2 6 . ← r k +1 + β k p k p k +1 P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 9 / 21
Introduction Self-stabilizing Conjugate Gradient Self-stabilizing Conjugate Gradient It is a Krylov subspace method, { p k } , { r k } spans Krylov subspace K ( A , r 0 , m ) K ( A , r 0 , m ) = span { r 0 , Ar 0 , . . . A m − 1 r 0 } Global orthogonality properties p T i Ap j = 0 if i � = j ; r T = 0 if i � = j ; and i r j r T i p j = 0 if i > j . Finite termination in exact arithmetic P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 10 / 21
Introduction Self-stabilizing Conjugate Gradient Effects of faults on Conjugate Gradient In general, most of Krylov subspace properties are lost Multiple potential outcomes due to faults Error in r k ⇒ Converge to incorrect value 1 Error in p k ⇒ Diverge, stagnation, slow convergence 2 Difficult to detect validity of state P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 11 / 21
Introduction Self-stabilizing Conjugate Gradient Self-stabilizing Conjugate Gradient We identify the following relations that are sufficient to guarentee convergence (a corollary to Zoutendijk condition) Residual condition : r k = b − Ax k r T k p k Optimal step length : α k = p T k Ap k Correct search direction : ( p T k r k ) � p k �� r k � > c 1 Local orthogonality relation : p k +1 T Ap k = 0 P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 12 / 21
Experiments Experiments Assume: selective reliability mode , i.e., correction step can be done reliably Inject faults in sparse matrix-vector (SpMV) product by flipping bits in matrix entry at a specified rate Bit flips in mantissa and sign bits - 40 bit flips in every 1 unreliable SpMV Bit flips can occur any where (including exponent) - 4 bit flips 2 in every unreliable SpMV P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 13 / 21
Experiments Problems We test self-stabilizing CG (CG-SS) on three problems with different convergence profiles and conditioning Name N NNZ κ ( A ) Convergence profile K3D 27000 183600 646 Quadratic DIAG 10000 10000 990100 Linear THERMAL1 82654 574458 496250 Sub-linear Table : Different problems used for experimentation P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 14 / 21
Experiments Solvers We compare performance of CG-SS against following solvers Reliable-CG : Where all the computations are done reliably 1 CG-SS : Self-stabilizing CG with correction done every 10 th 2 iteration CG-RES : Restarted CG: restart every 10 th iteration 3 FT-GMRES : Inner outer iteration based fault tolerant GMRES, 4 where outer iteration is done reliably P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 15 / 21
Experiments K3D : Quadratic Convergence In presence of faults, only linear convergence is observed Convergence History Convergence History Error Free CG Error Free CG 0 10 CG− SS 0 10 CG− SS CG− RES CG− RES − 2 10 FT− GMRES FT− GMRES − 4 10 − 5 10 − 6 10 − 8 10 − 10 10 − 10 10 − 12 10 − 15 − 14 10 10 0 100 200 300 100 200 300 50 150 250 0 50 150 250 Number of Iterations Number of Iterations (a) Bounded errors (mantissa (b) Unbounded errors (including and sign bit flips only) exponent also) P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 16 / 21
Experiments THERMAL1 : Sub-linear Convergence Convergence rate for CG-SS and CG-RES does not change by much; FT-GMRES shows better convergence due to pre-conditioning Convergence History Convergence History 0 10 0 10 Error Free CG Error Free CG CG− SS CG− SS − 1 10 CG− RES − 1 CG− RES 10 FT− GMRES FT− GMRES − 2 10 − 2 10 − 3 10 − 3 10 − 4 − 4 10 10 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 0 50 Number of Iterations Number of Iterations (a) Bounded errors (mantissa (b) Unbounded errors and sign bit flips only) (including exponent also) P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 17 / 21
Experiments DIAG : Linear Convergence Linear convergence is maintained. However, slight slow-down in convergence is observed Convergence History Convergence History 0 0 10 10 − 2 10 Error Free CG Error Free CG CG− SS CG− SS − 5 − 4 10 10 CG− RES CG− RES FT− GMRES FT− GMRES − 6 10 − 10 10 − 8 10 − 10 10 − 15 10 − 12 10 − 14 10 100 200 300 100 200 300 0 50 150 250 0 50 150 250 Number of Iterations Number of Iterations (a) Bounded errors (mantissa and (b) Unbounded errors (including sign bit flips only) exponent also) P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 18 / 21
Experiments Amount of Reliable Computation Required Compared to Reliable-CG, CG-SS requires < 30 % reliable SpMV to reach same error tolerance Error Tolerance versus # Reliable SpMV Error Tolerance versus # Reliable SpMV 1 1 Fraction of Reliable SpMV Error Free CG Fraction of Reliable SpMV 0.9 0.9 CG− SS Error Free CG 0.8 0.8 CG− RES CG− SS 0.7 FT− GMRES 0.7 CG− RES 0.6 0.6 FT− GMRES 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 − 5 − 10 − 15 0 − 5 − 10 − 15 10 10 10 10 10 10 10 10 E rror tolerance E rror tolerance (a) Bounded errors (mantissa (b) Unbounded errors and sign bit flips only) (including exponent also) P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 19 / 21
n { Observed Convergence in Presence of faults Ideal Convergence in �nite precision Analysis Analysis Can show that if κ ( A ) is the condition number and η is rate of bit flips per SpMV, then P. Sao, R. Vuduc (Georgia Tech) Self-stabilizing Iterative Solvers SIAM PP-14 20 / 21
Recommend
More recommend