coded qr decomposi on
play

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard - PowerPoint PPT Presentation

Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1 Mo.va.on 2 Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues


  1. Coded QR Decomposi.on Quang Minh Nguyen, MIT Haewon Jeong, Harvard University Pulkit Grover, Carnegie Mellon University 1

  2. Mo.va.on 2

  3. Mo.va.on Coded Compu)ng Coding Theory + Distributed Compu.ng Straggling Issue in Cloud Compu.ng Other Issues ?? - Coded Matrix Mul)plica)on [Lee et al. ’15, ’17, Yu et al. ’17, Jeong et al. ’17, ‘18, Baharav ’ 18, Sinong et al. ‘18, Shahrzad et al. ‘19] - Coded MapReduce [Li et al. ’15, ’17, ’18] - Coded Gradient Descent [Tandon et al. ’16, Raviv et al. ’17 Halbawi et al. ’18, Ye ’18] 3

  4. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) 4

  5. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) Larger Scale à Unreliability !! Fugaku supercomputer (2021) 150,000 nodes Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!! 5

  6. Mo.va.on Other Issues ?? - Processor Failure Issue in High Performance Compu.ng (HPC) Larger Scale à Unreliability !! HPC’s Solu.on: Algorithm-based fault-tolerance (ABFT) = Fugaku supercomputer (2021) adding encoded redundancy tailored 150,000 nodes to specific algorithm. Mean-.me-between-failures (MTBF) System-level MTBF=24-48 hours ~ node-level MTBF=411-822 years!! Same idea as Coded Compu)ng !! 6

  7. Mo.va.on bridge the gap ABFT for Coded HPC Compu)ng • QR Decomposi.on-- an important matrix factoriza.on in HPC, where ABFT faces challenges • More prac.cal HPC seeng that was not considered in coded compu.ng literature: - Block-cyclic distribu.on - In-node checksum storage (storing redundancies in systema.c nodes) à Coded QR Decomposi>on 7

  8. What is QR Decomposi.on? Orthogonal Q (i.e. Q T Q = I) Upper triangular R • QR decomposi.on is widely used in many HPC applica.ons: solving system of linear equa.ons, SVM, linear least squares problem, etc. 8

  9. ABFT for QR Decomposi.on Key idea: [O. Maslennikow et al. ‘98, P. Du et al. ‘12, P.Wu et al. ’ 14] R’ A R check- Q check- sums sums R’ is upper-triangular à R is upper-triangular So we can retrieve A=Q x R as the QR decomposi.on of A. 9

  10. Challenges in Coding for QR Decomposi.on • Can we do the same trick for Q protec.on? NO . Not orthogonal Q A checksums = x Q’ R Q’ T x Q’ = I does not imply Q T x Q = I checksums • Proven in [Theorem 5.1, P. Du et al. ’ 12] . à Challenge 1: Q protec)on via coding? Can we efficiently restore the orthogonality of Q? 10

  11. Challenges in Coding for QR Decomposi.on In-node checksum storage: • was recently proposed for ABFT [P. Du et al. ’ 12] . • stores coded data (checksums) in original processors instead of adding extra processors for fault tolerance. 11

  12. Challenges in Coding for QR Decomposi.on In-node checksum storage: Out-of-node checksum storage: (Conven.onal seeng) checksum checksum A 0 A 1 A 0 +A 1 A 0 A 1 A 0 +A 1 Node Node Node Node Node 0 1 0 1 2 - Fundamental Limit?? - Op.mal coding strategy: MDS à Can we s.ll have some op.mality guarantee like MDS condi.on? à Challenge 2: minimal number of checksums required under in-node checksum storage? 12

  13. Summary of Challenges Challenge 1: Q protec)on via coding? Challenge 2: minimal number of checksums required under in-node checksum storage? à Our Contribu>on: Address these 2 challenges 13

  14. System Model • For fault tolerance, we encode the n x n matrix A with both ver.cal and horizontal checksums as follows: where and are checksum-generator matrices. G v G h • Out-of-node checksum storage: The checksums are distributed over the new set of checksum processors. 14

  15. System Model Coded Compu)ng: Master-Worker SeWng Input Master Node A 0 A 1 A 2 A 3 redundancy A 2 A 3 A 0 A 1 A 0 A 1 Worker Worker Worker Worker Worker 1 2 3 4 5 Output Master Node 15

  16. System Model Coded Compu)ng: Master-Worker SeWng HPC SeWng: 2D block-cyclic distribu)on Input Master Node The input matrix A is distributed among • processors. The below layout is maintained throughout the A 0 A 1 A 2 A 3 • computa.on. redundancy Systema.c A 2 A 3 A 0 A 1 A 0 A 1 processors Worker Worker Worker Worker Worker Checksum 1 2 3 4 5 processors Output Master Node 16

  17. Failure Model and Real -.me Recovery in HPC Single-node fail-stop failures: • A failure corresponds to a systema.c processor that completely stops responding, and loses its part of the global data. • The iden.ty of the failed processor is provided by some external source. Real-.me Recovery: • The failure can occur at any point during the execu.on of QR decomposi.on, immediately triggering the recovery process. • Computa.on con.nues once the system has recovered from its latest failure. 17

  18. QR Decomposi.on: Modified Gram- Schmidt (MGS) algorithm We consider MGS, one of the 3 most widely use algorithms for QR decomposi.on. R Q computa.on computa.on 18

  19. Main Results Checksum-preserva.on for MGS Checksums preserved to facilitate fault-tolerant computa.on Challenge 1: Q protec)on via coding? à Post-orthogonaliza.on Post-processing to restore the Degraded Orthogonality Challenge 2: minimal number of checksums required under in-node checksum storage? à Op.mality for in-node checksum storage seeng Minimal number of checksums for single-node failure tolerance 19

  20. Checksum-preserva.on for MGS • To facilitate real-.me recovery, we want the checksums to be preserved at any itera.on of MGS (or GS). A → ! ! • We encode , and QR-factorizes . A A • At each itera.on , the algorithm t = 1,..., T maintains the updates and , so that at Q ( t ) R ( t ) ! the end is the QR decomposi.on A = Q ( T ) R ( T ) ! of . A 20

  21. Checksum-preserva.on for MGS We prove that: At any itera.on of MGS, t ! Q ( t ) R ( t ) A ( t ) ( t ) ( t ) G h Q 1 R R A AG h 1 1 ( t ) G v Q 1 G v A Checksums preserved! 21

  22. Checksum-preserva.on for MGS At the end, i.e. , we have: t = T ! Q ( T ) R ( T ) A Q 1 R R 1 G h A AG h 1 G v Q 1 G v A à Retrieve where is non-orthogonal (first challenge), and Q 1 A = Q 1 R 1 is upper-triangular. R 22 1

  23. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 23

  24. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 Main Idea: Cheap Post-processing: orthogonal matrix ! Q 1 → 24

  25. Challenge 1: Degraded Orthogonality of Conven.onal Coding Challenge 1: Not orthogonal R R 1 G h AG h Q 1 A 1 G v Q 1 G v A In this work, we raise the ques.on “How ‘non-orthogonal’ is ?” Q 1 Main Idea: Cheap Post-processing: G 0 Q 1 à Post-orthogonaliza)on: orthogonal matrix ! Q 1 → 25

  26. Post-orthogonaliza.on Ques)on: Can we always construct such that G 0 is orthogonal? G 0 Q 1 Not orthogonal c x n matrix It depends on . G v Q 1 Orthogonal G v Q 1 Checksum-generator matrix under our control !! 26

  27. Construc.on of G 0 n G v : G 1 c V n-c c c n-c c V I c + G 1 G 0 is sparse as G 0 = n-c − I n − c V T 27

  28. Post-orthogonaliza.on Condi.on for Checksum-generator Matrix Main Result: We could prove that if , then: • is orthogonal ( G 0 Q 1 ) Post-orthogonaliza)on • is inver.ble condi)on G 0 Reminder: ⎡ ⎤ checksum-generator matrix: G v = G 1 V ⎣ ⎦ à is now the QR decomposi.on A ' = G 0 A = ( G 0 Q 1 ) R of ! But would be useful? A ' A ' 28

  29. Post-orthogonaliza.on for Linear Solvers • We consider QR decomposi.on in solving a non-singular square system of linear equa.ons: Ax = b ⇔ A ' x = ( G 0 A ) x = G 0 b • QR factoriza.on of can now be used to find x: A ' ( G 0 Q 1 ) Rx = G 0 b Overhead of post-orthogonaliza.on: Matrix mul.plica.ons and ⇔ Rx = ( G 0 Q 1 ) T ( G 0 b ) ( G 0 Q 1 ) ( G 0 b ) • Finally, x can be found by triangular solve. à As G 0 is sparse, the total overhead for fault- tolerance is negligible. 29

  30. Checksum-Generator Matrices for Single-Node Failures Note: • Single-node failure is the most common scenario in HPC. • Anything related to mul.ple-node failure scenarios would be interes.ng future work! 30

  31. Checksum-Generator Matrices for Single- Node Failures Recap: R-factor protec.on: • Designing is straighporward, as there is no restric.on. G h • We can use MDS code for op.mality. Post-orthogonaliza)on Q-factor protec.on: condi)on • must sa.sfy . ⎡ ⎤ G v = G 1 V ⎣ ⎦ à Construc.on of to tolerate single-node failures. G v 31

  32. In-node Checksum Storage 32

  33. In-node Checksum Storage checksum A 0 A 1 A 0 +A 1 Node Node 0 1 • This new seeng could be more appealing in prac.ce as it does not require addi.onal processors. à Can we s.ll have some op.mality guarantee like MDS condi.on? à Challenge 2: minimal number of checksums required under this seWng? 33

Recommend


More recommend