stochastic optimization and sparse statistical recovery
play

Stochastic optimization and sparse statistical recovery: An optimal - PowerPoint PPT Presentation

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les


  1. Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les Houches, France

  2. Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse

  3. Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d

  4. Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d Want linear time and statistically (near) optimal algorithm

  5. Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse

  6. Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse Sparse logistic regression: θ ∗ = arg min θ E P [log(1 + exp( − y θ T x ))] .

  7. Example 2 : Compressed sensing θ ∗ y X w S n = n × d + S C Recover unknown signal θ ∗ from noisy measurements Sparse linear regression: θ ∗ = arg min θ E P [( y − θ T x ) 2 ] .

  8. Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1

  9. Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1 Statistical arguments for consistency, � θ n → θ ∗ Convex optimization to compute � θ n

  10. Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5

  11. Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5 But, smooth and strongly convex in sparse directions Example: Least-squares loss with random design

  12. Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n n = 2500 2 p= 5000 p=10000 0 p=20000 θ � ) (rescaled) −2 −4 log( � θ t − ˆ −6 −8 −10 50 100 150 Iteration Count

  13. Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n α = 16 . 3069 n = 2500 2 2 p= 5000 p= 5000 p=10000 p=10000 0 0 p=20000 p=20000 θ � ) (rescaled) θ � ) (rescaled) −2 −2 −4 −4 log( � θ t − ˆ log( � θ t − ˆ −6 −6 −8 −8 −10 −10 50 100 150 50 100 150 Iteration Count Iteration Count

  14. Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration

  15. Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration But we wanted linear time algorithm!

  16. Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t )

  17. Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t ) Stop after one pass over data Statistically, often competitive with batch (that is, � θ n − θ ∗ � 2 ≈ � � θ n − θ ∗ � 2 ) Precise rates depend on the problem structure

  18. Structural assumptions θ ∗ is s -sparse Make additional structural assumptions on L ( θ ) = E P [ ℓ ( θ ; z )] L is Locally Lipschitz L is Locally strongly convex (LSC)

  19. Locally Lipschitz functions Definition (Locally Lipschitz function) L is locally G -Lipschitz in ℓ 1 -norm, meaning that |L ( θ ) − L (˜ θ ) | ≤ G � θ − ˜ θ � 1 , if � θ − θ ∗ � 1 ≤ R and � ˜ θ − θ ∗ � 1 ≤ R . Globally Lipschitz Locally Lipschitz

  20. Locally strongly convex functions Definition (Locally strongly convex function) There is a constant γ > 0 such that θ − θ � + γ L (˜ θ ) ≥ L ( θ ) + �∇L ( θ ) , ˜ 2 � θ − ˜ θ � 2 2 , if � θ � 1 ≤ R and � ˜ θ � 1 ≤ R Locally Strongly convex Globally strongly convex

  21. Stochastic optimization and structural conditions Method Sparsity LSC Convergence � d � SGD O �� T � s 2 log d Mirror descent/RDA/FOBOS/COMID O T � � s log d Our Method O T

  22. Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t

  23. Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t Stochastic dual averaging: based on ℓ p distances, exploits sparstity when p ≈ 1 � t 1 θ t +1 = arg min � g s , θ � + � θ � 2 p 2 α t θ s =1 Need to reconcile the geometries for exploiting both structures

  24. RADAR algorithm: outline Based on Juditsky and Nesterov (2011) Recall the minimization problem: min θ E [ ℓ ( θ ; z )] Algorithm proceeds over K epochs At epoch i , solve the regularized problem: θ ∈ Ω i E [ ℓ ( θ ; z )] + λ i � θ � 1 min where Ω i = θ ∈ R d : � θ − y i � 2 p ≤ R 2 i

  25. RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

  26. RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

  27. RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

  28. RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

  29. RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update R 2 y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

  30. Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch R 2 θ ∗ y 2

  31. Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 R 2 θ ∗ y 2

  32. Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 Each step still O ( d ) R 2 θ ∗ y 2

Recommend


More recommend