HIGHLY PARALLEL METHODS FOR MACHINE LEARNING AND SIGNAL RECOVERY Tom Goldstein
TOPICS Introduction ADMM / Fast ADMM Application: Distributed computing Automation & Adaptivity
FIRST-ORDER METHODS minimize F ( x ) x Generalizes Gradient descent x 0 x k +1 = x k � τ r F ( x k ) x 2 x 1 x 4 x 3
FIRST-ORDER METHODS minimize F ( x ) x Generalizes Gradient descent x 0 x k +1 = x k � τ r F ( x k ) Pros x 2 x 1 • Linear complexity x 4 x 3 • Parallelizable • Low memory requirements
FIRST-ORDER METHODS minimize F ( x ) x Generalizes Gradient descent x k +1 = x k � τ r F ( x k ) Con Pros • Linear complexity • Poor convergence rates • Parallelizable • Low memory requirements
FIRST-ORDER METHODS minimize F ( x ) x Generalizes Gradient descent x k +1 = x k � τ r F ( x k ) Con Pros • Linear complexity • Poor convergence rates • Parallelizable • Low memory requirements Solution: Adaptivity and Acceleration
CONSTRAINED PROBLEMS minimize H ( u ) + G ( v ) subject to Au + Bv = b Big idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ
CONSTRAINED PROBLEMS minimize H ( u ) + G ( v ) subject to Au + Bv = b Big idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ
CONSTRAINED PROBLEMS minimize H ( u ) + G ( v ) subject to Au + Bv = b Big idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ
CONSTRAINED PROBLEMS minimize H ( u ) + G ( v ) subject to Au + Bv = b Big idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ
CONSTRAINED PROBLEMS minimize H ( u ) + G ( v ) subject to Au + Bv = b Big idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ • Optimality for : b − Au − Bv = 0 λ • Reduced energy: H ( u ) + G ( v ) • Saddle-point = Solution to constrained problem
ADMM minimize H ( u ) + G ( v ) subject to Au + Bv = b Big Idea: Lagrange multipliers u,v H ( u ) + G ( v ) + h λ , b � Au � Bv i + τ 2 k b � Au � Bv k 2 max min λ Alternating Direction Method of Multipliers H ( u ) + h λ k , � Au i + τ 2 k b � Au � Bv k k 2 u k +1 = arg min u G ( v ) + h λ k , � Bv i + τ 2 k b � Au k +1 � Bv k 2 v k +1 = arg min v λ k +1 = λ k + τ ( b � Au k +1 � Bv k +1 )
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising noisy image • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising clean image • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising total variation • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: blurred image min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: Convolution min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
EXAMPLE PROBLEMS Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ u � f ⇥ 2 • TV Denoising • TV Deblurring min | ⇤ u | + µ 2 ⇥ Ku � f ⇥ 2 • General Problem: min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 General Problem Goldstein & Osher, “Split Bregman,” 2009
WHY IS SPLITTING GOOD? Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 Goldstein & Osher, “Split Bregman,” 2009
WHY IS SPLITTING GOOD? Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 • Make change of Variables: v r u Goldstein & Osher, “Split Bregman,” 2009
WHY IS SPLITTING GOOD? Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 • Make change of Variables: v r u • ‘Split Bregman’ form: | v | + µ 2 k Au � f k 2 minimize v � r u = 0 subject to Goldstein & Osher, “Split Bregman,” 2009
WHY IS SPLITTING GOOD? Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 • Make change of Variables: v r u • ‘Split Bregman’ form: | v | + µ 2 k Au � f k 2 minimize v � r u = 0 subject to • Augmented Lagrangian 2 k Au � f k 2 + h λ , v � r u i + τ | v | + 1 2 k v � r u k 2 Goldstein & Osher, “Split Bregman,” 2009
WHY IS SPLITTING GOOD? Non-Smooth Problems min | ⇤ u | + µ 2 ⇥ Au � f ⇥ 2 2 k Au � f k 2 + h λ , v � r u i + τ | v | + 1 2 k v � r u k 2 ADMM for TV k Au � f k 2 + τ 2 k v k � r u � λ k k 2 u k +1 = arg min u | v | + τ 2 k v � r u k +1 � λ k k 2 v k +1 = arg min v λ k +1 = λ k + τ ( r u k +1 � v ) Goldstein, Osher. 2008
WHY IS SPLITTING BAD? min | ⇤ u | + µ TV Denoising: 2 ⇥ u � f ⇥ 2
GRADIENT VS. NESTEROV Gradient Nesterov x 0 x 0 x 1 x 2 x 2 x 1 x 4 x 3 x 3 x 4 ✓ 1 ✓ 1 ◆ ◆ O O k 2 k
GRADIENT VS. NESTEROV Gradient Nesterov x 0 x 0 x 1 x 2 x 2 x 1 x 4 x 3 x 3 x 4 ✓ 1 ✓ 1 ◆ ◆ Optimal O O k 2 k Nemirovski and Yudin ’83
NESTEROV’S METHOD minimize F ( x ) x Gradient Descent x k +1 = y k � τ r F ( y k ) ✓ ◆ α k +1 = 1 q Acceleration Factor 4 α 2 1 + k + 1 2 Prediction y k +1 = x k +1 + α k � 1 ( x k +1 � x k ) α k +1 Nesterov ’83
ACCELERATED SPLITTING METHODS
HOW TO MEASURE CONVERGENCE? No “Objective” to minimize Unconstrained Constrained 1 2 0.8 0.6 1.5 0.4 0.2 1 0 − 0.2 0.5 − 0.4 − 0.6 0 1 − 0.8 0.5 1 1 − 1 0.5 1 0.5 0 0.5 0 0 0 − 0.5 − 0.5 − 0.5 − 0.5 − 1 − 1 − 1 − 1 Convex Saddle
RESIDUALS minimize H ( u ) + G ( v ) subject to Au + Bv = b • Lagrangian min u,v max H ( u ) + G ( v ) + h λ , b � Au � Bv i λ
RESIDUALS minimize H ( u ) + G ( v ) subject to Au + Bv = b • Lagrangian min u,v max H ( u ) + G ( v ) + h λ , b � Au � Bv i λ • Derivative for λ b − Au − Bv = 0 ∂ H ( u ) − A T λ = 0 • Derivative for u
RESIDUALS minimize H ( u ) + G ( v ) subject to Au + Bv = b • Lagrangian min u,v max H ( u ) + G ( v ) + h λ , b � Au � Bv i λ • Derivative for λ b − Au − Bv = 0 ∂ H ( u ) − A T λ = 0 • Derivative for u • We have convergence when derivatives are ‘small’ r k = b − Au k − Bv k d k = ∂ H ( u k ) − A T λ k
RESIDUALS minimize H ( u ) + G ( v ) subject to Au + Bv = b • Lagrangian min u,v max H ( u ) + G ( v ) + h λ , b � Au � Bv i λ • Derivative for λ b − Au − Bv = 0 ∂ H ( u ) − A T λ = 0 • Derivative for u • We have convergence when derivatives are ‘small’ r k = b − Au k − Bv k d k = τ A T B ( v k − v k − 1 )
EXPLICIT RESIDUALS • Explicit formulas for residuals r k = b − Au k − Bv k d k = τ A T B ( v k − v k − 1 )
EXPLICIT RESIDUALS • Explicit formulas for residuals r k = b − Au k − Bv k d k = τ A T B ( v k − v k − 1 ) • Combined residual c k = k r k k 2 + 1 τ k d k k 2 Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 & Yuan and He. 2012
EXPLICIT RESIDUALS • Explicit formulas for residuals r k = b − Au k − Bv k d k = τ A T B ( v k − v k − 1 ) • Combined residual c k = k r k k 2 + 1 τ k d k k 2 • ADMM/AMA converge at rate c k ≤ O (1 /k ) Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 & Yuan and He. 2012
EXPLICIT RESIDUALS • Explicit formulas for residuals r k = b − Au k − Bv k d k = τ A T B ( v k − v k − 1 ) • Combined residual c k = k r k k 2 + 1 τ k d k k 2 • ADMM/AMA converge at rate c k ≤ O (1 /k ) ✓ 1 ◆ Goal: O k 2 Goldstein, O'Donoghue, Setzer, Baraniuk. 2012 & Yuan and He. 2012
FAST ADMM v 0 2 R N v , λ − 1 = ˆ λ 0 2 R N b , τ > 0 Require: v − 1 = ˆ 1: for k = 1 , 2 , 3 . . . do u k = argmin H ( u ) + h ˆ v k k 2 λ k , � Au i + τ 2 k b � Au � B ˆ 2: v k = argmin G ( v ) + h ˆ 2 k b � Au k � Bv k 2 λ k , � Bv i + τ 3: λ k = ˆ λ k + τ ( b � Au k � Bv k ) 4: 1+ p 1+4 α 2 α k +1 = k 5: 2 v k +1 = v k + α k − 1 ˆ α k +1 ( v k � v k − 1 ) 6: ˆ λ k +1 = λ k + α k − 1 α k +1 ( λ k � λ k − 1 ) 7: 8: end for Goldstein, O'Donoghue, Setzer, Baraniuk. 2012
Recommend
More recommend