numerically stable binary gradient coding
play

Numerically Stable Binary Gradient Coding Neophytos Charalambides - PowerPoint PPT Presentation

Numerically Stable Binary Gradient Coding Neophytos Charalambides Hessam Mahdavifar Alfred Hero Department of Electrical Engineering and Computer Science, University of Michigan June, 2020 1 / 21 Outline for section 1 Introduction and


  1. Numerically Stable Binary Gradient Coding Neophytos Charalambides Hessam Mahdavifar Alfred Hero Department of Electrical Engineering and Computer Science, University of Michigan June, 2020 1 / 21

  2. Outline for section 1 Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers 2 / 21

  3. Issues and Motivation Introduction and Motivation Machine Learning Today: Curse of Dimensionality ◮ Large Datasets — many samples ◮ Complex Datasets — large dimension ◮ Problems become intractable Use distributed methods ◮ Distribute smaller computation assignments ◮ Multiple servers complete various tasks Drawbacks of Distributed Synchronous Computations ◮ Requires all servers to respond — communication overhead ◮ What if stragglers are present? ◮ Stragglers — servers with delays or non-responsive 3 / 21

  4. Gradient Coding 1 Introduction and Motivation 1. Speed up distributive computation — gradient methods 2. Mitigate stragglers 1 R Tandon et al. “Gradient Coding: Avoiding Stragglers in Synchronous Gradient Descent”. In: stat 1050 (2017), p. 8. 4 / 21

  5. Benefits of our Binary Scheme Introduction and Motivation Few schemes deal with exact recovery Common issues with current exact recovery schemes 1. construct and search through a decoding matrix 1 A T ∈ R ( n s ) × n 2. storage issue, and further delay 3. work over R and C — further numerical instability 4. have a strict assumption that ( s + 1) | n Our scheme 1. faster online decoding 2. only deal with { 0 , 1 } encodings — view as “task assignments” 3. ... this makes encoding and decoding numerically stable 4. works for any pair s , n 5. ... extend our construction to work for heterogeneous workers also 5 / 21

  6. Outline for section 2 Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers 6 / 21

  7. Distributed Gradient Descent Gradient Coding i =1 � R p × R , or X ∈ R N × p ; y ∈ R N ◮ Dataset D = { ( x i , y i ) } N k � ◮ Partition D = D j , s.t. D i ∩ D j = ∅ and |D j | = N k j =1 ◮ Partial gradients g j — gradient on D j k � ◮ Minimize the loss L ( D ; θ ) = ℓ ( D j ; θ ) j =1 ◮ Gradient descent updates: θ ( t +1) = θ ( t ) − α t g ( t ) g ( t ) j � �� � k � D j ; θ ( t ) � k ◮ g ( t ) = ∇ θ L � D ; θ ( t ) � � � g ( t ) = ∇ θ ℓ = j j =1 j =1 ◮ additive structure allows g ( t ) to be computed in parallel ! 7 / 21

  8. Synchronous Distributed Computation Gradient Coding ◮ Execute gradient descent distributively ◮ Need all workers to respond Figure: Need all responses — g = g 1 + g 2 + g 3 8 / 21

  9. Table of Contents Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers 9 / 21

  10. General Setup Problem Setup 10 / 21

  11. Encoding matrix Problem Setup ◮ Rows: workers { W i } n i =1 ◮ b i = encoding vector for W i ◮ Columns: partitions {D j } k i =1 1. nonzero entries: assigned partitions 2. redundancy in assigned D j ’s ◮ Stragglers ≡ erasing rows of B 11 / 21

  12. Table of Contents Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers 12 / 21

  13. Example of our Binary Scheme Binary Scheme n = k = 11 , s = 3 = ⇒ r ≡ 3 mod ( s + 1) r workers for B 1 , and ( s + 1 − r ) for B 2   1 1 1 1 1 1 1 1     1 1 1 1     1 1 1 1     ∈ { 0 , 1 } 9 × 11 B 1 = 1 1 1 1     1 1 1 1     1 1 1     1 1 1   1 1 1 � � 1 1 1 1 1 1 ∈ { 0 , 1 } 2 × 11 B 2 = 1 1 1 1 1 13 / 21

  14. Example — Encoding and Decoding Binary Scheme Decoding : only take received workers of same color a T Example : { 2 , 6 , 10 } B = 1 11 × 1             0 0 0 1 1 1 1 1       0 0 0 1 1 1 1 1                             0 0 0 1 1 1 1 1                             1 1 1 1 1 1 0 0 0 1                             0 0 0 1 1 1 1 1                         B = a I ∈ 0 , , 0 , 0 1 1 1 1 1                       0 0 0 1 1 1 1 1                             1 1 1 1 1 0 0 0 1                             1 1 1 1 0 0 0                             0 0 0 1 1 1 1                   0 0 0 1 1 1 1 14 / 21

  15. Main Idea of Our Binary Scheme Binary Scheme ◮ Have B as sparse as possible = ⇒ nnzr( B ) = k · ( s + 1) ◮ Work with congruence classes (mod s + 1) ◮ superposition of rows of each class results in 1 1 × k ◮ Allocate tasks s.t. � b i � 0 ≃ � b j � 0 for all i , j ∈ { 1 , · · · , n } , while satisfying the above two constraints ◮ Formally, construct B that is a solution to n � � � � � � � b i � 0 − ( s +1) · k / n s.t. nnzr( B ) = k · ( s +1) min � B ∈ N n × k i =1 0 ◮ Intuition : B is close to being block diagonal 15 / 21

  16. Construction and Decoding Binary Scheme ◮ Congruence classes C 1 = { [ i ] } r − 1 i =0 and C 2 = { [ i ] } s i = r : 1. r ≡ n mod ( s + 1) 2. respectively identically 3. within each C 1 , C 2 , cardinalities do not differ by more than one 4. construct B 1 and B 2 ◮ B = aggregation of B 1 and B 2 ◮ Decoding : By the pigeonhole principle , for any f workers, at least one complete residue system is present 16 / 21

  17. ⇒ r = 5 Larger Example: n = k = 165 and s = 15 = Binary Scheme Do not want a lot of redundancy — close to block diagonal 17 / 21

  18. Outline for section 3 Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers 18 / 21

  19. Setup a Linear System Allocation to Heterogeneous workers ◮ Assume two groups of different machines T 1 , T 2 , s.t. : t i = E [time for T i to compute g j ] and t 1 � t 2 ◮ Goal : Want same expectation time for each worker ◮ Let |J T i | = # of partitions allocated to T i ’s workers ◮ Let |T i | = τ i and τ 1 = α β · τ 2 Solve the linear system: 1. t 1 · |J T 1 | = t 2 · |J T 2 | 2. |J T 1 | · τ 1 + |J T 2 | · τ 2 = ( s + 1) · k 3. τ 2 = β α · τ 1 19 / 21

  20. Main Takeaways of Our Scheme ◮ Gave a simple gradient coding scheme ◮ Faster online decoding ◮ Numerically stable in encoding and decoding ◮ Works for any pair s , n ◮ Extended it to accommodate heterogeneous workers also 20 / 21

  21. Thank you for your attention! �

  22. Outline for section 4 Additional Slides Details of the constructions Explicit Algorithms 22 / 21

  23. Idea Behind Binary Scheme Details of the constructions ◮ When ( s + 1) | n and k = n — B is block diagonal � � ◮ assign to each worker ℓ = n partitions in a repeated sense s +1 ◮ For ( s + 1) ∤ n , each worker in blocks of ( s + 1) rows corresponds to a distinct congruence class (c.c.) mod( s + 1) ◮ When any f workers send their computations, at least one congruence class is met in every block — pigeonhole � � ◮ ∃ i ∈ Z / ( s + 1) s.t. i + j ( s + 1) ∈ I , for all j = 0 , 1 , · · · , ℓ − 1 ◮ there received workers “always form a coset” ◮ Decoding: select any such i , and sum the vectors received by the ℓ − 1 a T = � workers of the c.c. i — e i + j ( s +1) j =0 ◮ Want “even” number of assignments — homogeneous servers 23 / 21

  24. Binary Scheme when ( s + 1) ∤ n Details of the constructions ◮ Determine the integer parameters ◮ n = ℓ · ( s + 1) + r 0 ≤ r < s + 1 ◮ r = t · ℓ + q 0 ≤ q < ℓ ◮ n = λ · ( ℓ + 1) + ˜ 0 ≤ ˜ r < ℓ + 1 r ◮ Define: C 1 := { [ i ] s +1 } r − 1 C 2 := { [ i ] s +1 } s and i =0 i = r ◮ workers C 1 lie in all ( ℓ + 1) blocks, and C 1 lie in first ℓ ◮ C 1 load: { s + 1 , s } if ℓ + r > s , o.w. { λ + 1 , λ } ◮ C 2 load: { s + t + 2 , s + t + 1 } if q > 0, o.w. all have s + t + 1 24 / 21

Recommend


More recommend