Stochastic optimization and sparse statistical recovery: An optimal - PowerPoint PPT Presentation
Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les
Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les Houches, France
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d
Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d Want linear time and statistically (near) optimal algorithm
Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse
Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse Sparse logistic regression: θ ∗ = arg min θ E P [log(1 + exp( − y θ T x ))] .
Example 2 : Compressed sensing θ ∗ y X w S n = n × d + S C Recover unknown signal θ ∗ from noisy measurements Sparse linear regression: θ ∗ = arg min θ E P [( y − θ T x ) 2 ] .
Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1
Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1 Statistical arguments for consistency, � θ n → θ ∗ Convex optimization to compute � θ n
Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5
Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5 But, smooth and strongly convex in sparse directions Example: Least-squares loss with random design
Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n n = 2500 2 p= 5000 p=10000 0 p=20000 θ � ) (rescaled) −2 −4 log( � θ t − ˆ −6 −8 −10 50 100 150 Iteration Count
Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n α = 16 . 3069 n = 2500 2 2 p= 5000 p= 5000 p=10000 p=10000 0 0 p=20000 p=20000 θ � ) (rescaled) θ � ) (rescaled) −2 −2 −4 −4 log( � θ t − ˆ log( � θ t − ˆ −6 −6 −8 −8 −10 −10 50 100 150 50 100 150 Iteration Count Iteration Count
Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration
Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration But we wanted linear time algorithm!
Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t )
Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t ) Stop after one pass over data Statistically, often competitive with batch (that is, � θ n − θ ∗ � 2 ≈ � � θ n − θ ∗ � 2 ) Precise rates depend on the problem structure
Structural assumptions θ ∗ is s -sparse Make additional structural assumptions on L ( θ ) = E P [ ℓ ( θ ; z )] L is Locally Lipschitz L is Locally strongly convex (LSC)
Locally Lipschitz functions Definition (Locally Lipschitz function) L is locally G -Lipschitz in ℓ 1 -norm, meaning that |L ( θ ) − L (˜ θ ) | ≤ G � θ − ˜ θ � 1 , if � θ − θ ∗ � 1 ≤ R and � ˜ θ − θ ∗ � 1 ≤ R . Globally Lipschitz Locally Lipschitz
Locally strongly convex functions Definition (Locally strongly convex function) There is a constant γ > 0 such that θ − θ � + γ L (˜ θ ) ≥ L ( θ ) + �∇L ( θ ) , ˜ 2 � θ − ˜ θ � 2 2 , if � θ � 1 ≤ R and � ˜ θ � 1 ≤ R Locally Strongly convex Globally strongly convex
Stochastic optimization and structural conditions Method Sparsity LSC Convergence � d � SGD O �� T � s 2 log d Mirror descent/RDA/FOBOS/COMID O T � � s log d Our Method O T
Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t
Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t Stochastic dual averaging: based on ℓ p distances, exploits sparstity when p ≈ 1 � t 1 θ t +1 = arg min � g s , θ � + � θ � 2 p 2 α t θ s =1 Need to reconcile the geometries for exploiting both structures
RADAR algorithm: outline Based on Juditsky and Nesterov (2011) Recall the minimization problem: min θ E [ ℓ ( θ ; z )] Algorithm proceeds over K epochs At epoch i , solve the regularized problem: θ ∈ Ω i E [ ℓ ( θ ; z )] + λ i � θ � 1 min where Ω i = θ ∈ R d : � θ − y i � 2 p ≤ R 2 i
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update R 2 y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch R 2 θ ∗ y 2
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 R 2 θ ∗ y 2
Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 Each step still O ( d ) R 2 θ ∗ y 2
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.