c o c o a communication efficient coordinate ascent
play

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - PowerPoint PPT Presentation

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak , Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan LARGE-SCALE OPTIMIZATION LARGE-SCALE OPTIMIZATION C O C O A


  1. Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE

  2. Mini-batch Limitations 1. METHODS BEYOND SGD 
 Use Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE 
 Average over K << batch size

  3. Co mmunication-Efficient Distributed Dual Co ordinate A scent (CoCoA) 1. METHODS BEYOND SGD 
 Use Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE 
 Average over K << batch size

  4. 1. Primal-Dual Framework ≥ PRIMAL DUAL

  5. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # n 2 || w || 2 + 1 P ( w ) := � X ` i ( w T x i ) min n w ∈ R d i =1

  6. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i

  7. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap

  8. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice

  9. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear

  10. 2. Immediately Apply Updates

  11. 2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w

  12. 2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w for i 2 b ∆ w ∆ w � α r i P ( w ) FRESH w w + ∆ w end

  13. 3. Average over K

  14. 3. Average over K reduce: w = w + 1 P k ∆ w k K

  15. CoCoA Algorithm 1: CoCoA Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  16. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  17. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  18. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately 
 reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  19. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately 
 reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k ✔ average over K Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  20. Convergence

  21. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n

  22. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K

  23. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  24. 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  25. 
 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  26. 
 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K ✔ convergence rate is linear for applies also to duality gap smooth losses 0 ≤ σ ≤ n/K measure of difficulty of data partition

  27. LARGE-SCALE OPTIMIZATION C O C O A RESULTS!

  28. LARGE-SCALE OPTIMIZATION C O C O A RESULTS!

  29. Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522,911 54 22.22% 4 Rcv1 677,399 47,236 0.16% 8 Imagenet 32,751 160,000 100% 32

  30. Imagenet Imagenet 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e3) mini − batch − CD (H=1) local − SGD (H=1e3) mini − batch − SGD (H=10) − 6 − 6 10 10 0 0 200 200 400 400 600 600 800 800 Time (s)

  31. RCV1 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e5) minibatch − CD (H=100) local − SGD (H=1e4) batch − SGD (H=100) − 6 − 6 10 10 0 0 100 100 200 200 300 300 400 400 Time (s) Imagenet Imagenet Cov Cov 2 2 10 10 2 2 10 10 Log Primal Suboptimality 0 0 Log Primal Suboptimality 0 0 10 10 10 10 − 2 − 2 − 2 − 2 10 10 10 10 − 4 − 4 − 4 − 4 10 10 10 10 COCOA (H=1e5) COCOA (H=1e3) minibatch − CD (H=100) mini − batch − CD (H=1) local − SGD (H=1e5) local − SGD (H=1e3) batch − SGD (H=1) mini − batch − SGD (H=10) − 6 − 6 10 10 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 800 Time (s) Time (s)

  32. Effect of H on C O C O A 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 1e5 − 4 − 4 10 10 1e4 1e3 100 1 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 Time (s)

  33. C O C O A Take-Aways

Recommend


More recommend