Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les Houches, France
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d Want linear time and statistically (near) optimal algorithm
Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse
Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse Sparse logistic regression: θ ∗ = arg min θ E P [log(1 + exp( − y θ T x ))] .
Example 2 : Compressed sensing θ ∗ y X w S n = n × d + S C Recover unknown signal θ ∗ from noisy measurements Sparse linear regression: θ ∗ = arg min θ E P [( y − θ T x ) 2 ] .
Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1
Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1 Statistical arguments for consistency, � θ n → θ ∗ Convex optimization to compute � θ n
Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5
Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5 But, smooth and strongly convex in sparse directions Example: Least-squares loss with random design
Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n n = 2500 2 p= 5000 p=10000 0 p=20000 θ � ) (rescaled) −2 −4 log( � θ t − ˆ −6 −8 −10 50 100 150 Iteration Count
Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n α = 16 . 3069 n = 2500 2 2 p= 5000 p= 5000 p=10000 p=10000 0 0 p=20000 p=20000 θ � ) (rescaled) θ � ) (rescaled) −2 −2 −4 −4 log( � θ t − ˆ log( � θ t − ˆ −6 −6 −8 −8 −10 −10 50 100 150 50 100 150 Iteration Count Iteration Count
Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration
Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration But we wanted linear time algorithm!
Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t )
Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t ) Stop after one pass over data Statistically, often competitive with batch (that is, � θ n − θ ∗ � 2 ≈ � � θ n − θ ∗ � 2 ) Precise rates depend on the problem structure
Structural assumptions θ ∗ is s -sparse Make additional structural assumptions on L ( θ ) = E P [ ℓ ( θ ; z )] L is Locally Lipschitz L is Locally strongly convex (LSC)
Locally Lipschitz functions Definition (Locally Lipschitz function) L is locally G -Lipschitz in ℓ 1 -norm, meaning that |L ( θ ) − L (˜ θ ) | ≤ G � θ − ˜ θ � 1 , if � θ − θ ∗ � 1 ≤ R and � ˜ θ − θ ∗ � 1 ≤ R . Globally Lipschitz Locally Lipschitz
Locally strongly convex functions Definition (Locally strongly convex function) There is a constant γ > 0 such that θ − θ � + γ L (˜ θ ) ≥ L ( θ ) + �∇L ( θ ) , ˜ 2 � θ − ˜ θ � 2 2 , if � θ � 1 ≤ R and � ˜ θ � 1 ≤ R Locally Strongly convex Globally strongly convex
Stochastic optimization and structural conditions Method Sparsity LSC Convergence � d � SGD O �� T � s 2 log d Mirror descent/RDA/FOBOS/COMID O T � � s log d Our Method O T
Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t
Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t Stochastic dual averaging: based on ℓ p distances, exploits sparstity when p ≈ 1 � t 1 θ t +1 = arg min � g s , θ � + � θ � 2 p 2 α t θ s =1 Need to reconcile the geometries for exploiting both structures
RADAR algorithm: outline Based on Juditsky and Nesterov (2011) Recall the minimization problem: min θ E [ ℓ ( θ ; z )] Algorithm proceeds over K epochs At epoch i , solve the regularized problem: θ ∈ Ω i E [ ℓ ( θ ; z )] + λ i � θ � 1 min where Ω i = θ ∈ R d : � θ − y i � 2 p ≤ R 2 i
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update R 2 y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch R 2 θ ∗ y 2
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 R 2 θ ∗ y 2
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 Each step still O ( d ) R 2 θ ∗ y 2
Recommend
More recommend