fault tolerant matrix factorisation a formal model and
play

Fault-tolerant matrix factorisation: a formal model and proof - PowerPoint PPT Presentation

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 1/26 Fault-tolerant matrix factorisation: a formal model and proof Camille Coti, Laure Petrucci, Daniel Alberto Torres Gonz alez Laboratoire dInformatique de


  1. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 1/26 Fault-tolerant matrix factorisation: a formal model and proof Camille Coti, Laure Petrucci, Daniel Alberto Torres Gonz´ alez Laboratoire d’Informatique de Paris Nord, CNRS UMR 7030, Universit´ e Paris 13, Sorbonne Paris Cit´ e 99, avenue Jean-Baptiste Cl´ ement F-93430 Villetaneuse, FRANCE camille.coti@lipn.univ-paris13.fr laure.petrucci@lipn.univ-paris13.fr daniel.torres@lipn.univ-paris13.fr April 6, 2019

  2. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 2/26 Content Motivation 1 Introduction 2 High Performance Computing Fault Tolerance Formal Models The FT-TSQR algorithm 3 TSQR FT-TSQR Model 4 Properties 5 Conclusion and perspectives 6

  3. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 3/26 Motivation Motivation Matrix operations Addition, transposition, matrix multiplication Row operations, submatrix Diagonal matrix, triangular matrix, identity matrix, orthogonal matrix Determinant, eigenvalues, eigenvectors

  4. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 3/26 Motivation Motivation Matrix operations Addition, transposition, matrix multiplication Row operations, submatrix Diagonal matrix, triangular matrix, identity matrix, orthogonal matrix Determinant, eigenvalues, eigenvectors Decompositions QR, LU, Cholesky TSQR: iterative methods use it Linear systems with multiple right-hand sides Block iterative eigensolvers s-step Krylov methods

  5. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 4/26 Motivation Motivation Fault tolerance in HPC System-level Transparent for the application Specific middleware to ensure coherent state of the application Application-level The application itself handles the failures and adapt to them The middleware must be robust enough to provide primitives

  6. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 5/26 Introduction High Performance Computing High Performance Systems Platforms at large scale Have their own technical challenges The total number of hardware and software components grows exponentially Platforms needed to manage and handle complex computational problems Hardware or software failures may occur anytime during the execution of high parallel applications System reliability, availability and scalability are factors to deal with Failures may result in a high execution times and high cost

  7. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 6/26 Introduction High Performance Computing Top500.org Top500 A statistical list with ranks and details of the 500 world’s most powerful supercomputers It shows that performance has almost doubled each year Rmax Rpeak Power Rank System Cores (TFlop/s) (TFlop/s) (kW) Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR 1 2,397,824 143,500.0 200,794.9 9,783 Infiniband , IBM DOE/SC/Oak Ridge National Laboratory United States Sierra - IBM Power System S922LC, IBM POWER9 22C 3.1GHz, 2 NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband , IBM / 94,640.0 125,712.0 7,438 1,572,480 NVIDIA / Mellanox DOE/NNSA/LLNL United States Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 3 1.45GHz, Sunway , NRCPC National Supercomputing Center in 10,649,600 93,014.6 125,435.9 15,371 Wuxi China Tianhe-2A - TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 4 2.2GHz, TH Express-2, Matrix-2000 , NUDT National Super 4,981,760 61,444.5 100,678.7 18,482 Computer Center in Guangzhou China Piz Daint - Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries 5 interconnect , NVIDIA Tesla P100 , Cray Inc. Swiss National 21,230.0 27,154.3 2,384 387,872 Supercomputing Centre (CSCS) Switzerland

  8. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 7/26 Introduction High Performance Computing #1 Cores Number 1.2x10 7 #1 Cores Number 1x10 7 8x10 6 # Cores 6x10 6 4x10 6 2x10 6 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year

  9. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 8/26 Introduction Fault Tolerance Failures Failures in High Performance Systems Node increase in HPC ⇒ platforms more subject to failures Mean Time Between Failures (MTBF): measure of system reliability Defined as the probability that the system performs without deviations from agreed-upon behavior for a specific period of time n − 1 1 � ) − 1 MTBF T = ( MTBF i i =0 Failures arise anytime Stops partially or totally the execution (crash-type failures) Provides incorrect results (bit errors) With an increase in the number of components, the system will experience a component failure every few hours or even minutes

  10. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 9/26 Introduction Fault Tolerance MTBF ) s r u o h 5000 ( m e 10 000 H t s y S 4000 100 000 H e h t 1 000 000 H f o 3000 s 10 000 000 H e r u l i a F 2000 n e e w t e 1000 B e m i T 0 n 1 10 100 1000 10000 100000 1e+06 a e M Number of components in the system

  11. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 10/26 Introduction Fault Tolerance Fault tolerance Challenges in HPC HPC algorithms should be designed to: expect failures: very difficult to predict all possible failures take suitable actions: ensure that intensive applications run smoothly with reduced overhead Fault tolerant solutions are being incorporated Have the ability to contain failures Minimize the impact of failures Provide a fault tolerant environment Enhance the utilization of the system at high scale Ensure the failure-free execution of critical algorithms Hard to describe and verify the system’s properties: how to simplify it?

  12. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 11/26 Introduction Formal Models Formal models Coloured Petri Nets (CPN) Better understanding of the system The system ensures mathematically that it is correct Modelling, validating properties and synchronizing communications of parallel and distributed algorithms Allow for better readability and understandability FT-TSQR Formal Model Formal model and associated verifications Proves it tolerates the failures Guarantees that the final results are correct Data flow is correct Each process calculates and shares its results At the end, all the process have the same result

  13. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 12/26 The FT-TSQR algorithm TSQR Los Tres Amigos QR factorization : A = QR • R upper triangular • Q orthogonal LU decomposition : A = LU • L lower triangular • U upper triangular Cholesky factorization : A = LL T • A symmetric, positive definite • L lower triangular

  14. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 13/26 The FT-TSQR algorithm TSQR Tall and Skinny QR Tall and Skinny QR (TSQR) Factorisation It calculates the QR factorisation of a tall and skinny matrix A , i.e. a matrix with m rows and n columns, m ≫ n Linear algebra applications depend on the algorithm: Ax = b numerically stable: eigenvalues computation is sensitive to the accuracy of the orthogonalization

  15. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes P 0 P 1 P 2 P 3

  16. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes P 0 A 0 P 1 A 1 P 2 A 2 P 3 A 3

  17. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes QR P 0 A 0 Q 0 R 0 P 1 A 1 Q 1 R 1 P 2 A 2 Q 2 R 2 P 3 A 3 Q 3 R 3

  18. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes QR P 0 A 0 Q 0 R 0 R 0 R 1 P 1 A 1 Q 1 R 1 R 0 R 1 P 2 A 2 Q 2 R 2 R 2 R 3 P 3 A 3 Q 3 R 3 R 2 R 3

  19. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes QR QR P 0 A 0 Q 0 R 0 R 0 Q 01 R 01 R 1 P 1 A 1 Q 1 R 1 R 0 Q 01 R 01 R 1 P 2 A 2 Q 2 R 2 R 2 Q 23 R 23 R 3 P 3 A 3 Q 3 R 3 R 2 Q 23 R 23 R 3

  20. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes QR QR P 0 A 0 Q 0 R 0 R 0 Q 01 R 01 R 01 R 1 R 23 P 1 A 1 Q 1 R 1 R 0 Q 01 R 01 R 01 R 1 R 23 P 2 A 2 Q 2 R 2 R 2 Q 23 R 23 R 01 R 3 R 23 P 3 A 3 Q 3 R 3 R 2 Q 23 R 23 R 01 R 3 R 23

  21. Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR Fault-Tolerant TSQR t processes QR QR QR P 0 A 0 Q 0 R 0 R 0 Q 01 R 01 R 01 Q 0123 R 0123 R 1 R 23 P 1 A 1 Q 1 R 1 R 0 Q 01 R 01 R 01 Q 0123 R 0123 R 1 R 23 P 2 A 2 Q 2 R 2 R 2 Q 23 R 23 R 01 Q 0123 R 0123 R 3 R 23 P 3 A 3 Q 3 R 3 R 2 Q 23 R 23 R 01 Q 0123 R 0123 R 3 R 23

Recommend


More recommend