MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 06: Learning with Stochastic Gradients Sasha Rakhlin A. Rakhlin, 9.520/6.860 2018
Why Optimization? Much (but not all) of Machine Learning: write down objective function involving data and parameters, find good (or optimal) parameters through optimization. Key idea: find a near-optimal solution by iteratively using only local information about the objective (e.g. gradient, Hessian). A. Rakhlin, 9.520/6.860 2018
Motivating example: Newton’s Method Newton’s method in 1d: w t +1 = w t − ( f ′′ ( w t )) − 1 f ′ ( w t ) Example (parabola): f ( w ) = aw 2 + bw + c Start with any w 1 . Then Newton’s Method gives w 2 = w 1 − (2 a ) − 1 (2 aw 1 + b ) which means w 2 = − b / (2 a ). Finds minimum of f in 1 step, no matter where you start ! A. Rakhlin, 9.520/6.860 2018
Newton’s Method in multiple dim: w t +1 = w t − [ ∇ 2 f ( w t )] − 1 ∇ f ( w t ) (here ∇ 2 f ( w t ) is the Hessian, assume invertible) A. Rakhlin, 9.520/6.860 2018
Recalling Least Squares Least Squares objective (without 1 / n normalization) n i w ) 2 = � Y − Xw � 2 � f ( w ) = ( y i − x T i =1 Calculate: ∇ 2 f ( w ) = 2 X T X and ∇ f ( w ) = − 2 X T ( Y − Xw ). Taking w 1 = 0, the Newton’s Method gives T X ) − 1 2 X T X ) − 1 X T ( Y − X 0) = ( X T Y w 2 = 0 + (2 X which is the least-squares solution (global min). Again, 1 step is enough. Verify: if f ( w ) = � Y − Xw � 2 + λ � w � 2 , ( X T X ) becomes ( X T X + λ ) A. Rakhlin, 9.520/6.860 2018
What do we do if data ( x 1 , y 1 ) , . . . , ( x n , y n ) , . . . are streaming? Can we incorporate data on the fly without having to re-compute inverse ( X T X ) at every step? − → Online Learning A. Rakhlin, 9.520/6.860 2018
Let w 1 = 0. Let w t be least-squares solution after seeing t − 1 data points. Can we get w t from w t − 1 cheaply? Newton’s Method will do it in 1 step (since objective is quadratic). Let C t = � t i =1 x i x T i (or + λ I ) and X t = [ x 1 , . . . , x t ] T , Y t = [ y 1 , . . . , y t ] T . Newton’s method gives w t +1 = w t + C − 1 t ( Y t − X t w t ) T X t This can be simplified to w t +1 = w t + C − 1 x t ( y t − x t w t ) T t since residuals up to t − 1 are orthogonal to columns of X t − 1 . The bottleneck is computing C − 1 . Can we update it quickly from C − 1 t − 1 ? t A. Rakhlin, 9.520/6.860 2018
Sherman-Morrison formula: for invertible square A and any u , v T ) − 1 = A − 1 − A − 1 uv T A − 1 ( A + uv 1 + v T A − 1 u Hence t − 1 − C − 1 t C − 1 t − 1 x t x T C − 1 = C − 1 t − 1 t t C − 1 1 + x T t − 1 x t and (do the calculation) 1 C − 1 x t = C − 1 t − 1 x t · t t C − 1 1 + x T t − 1 x t Computation required: d × d matrix C − 1 times a d × 1 vector = O ( d 2 ) t time to incorporate new datapoint. Memory: O ( d 2 ). Unlike full regression from scratch, does not depend on amount of data t . A. Rakhlin, 9.520/6.860 2018
Recursive Least Squares (cont.) Recap: recursive least squares is w t +1 = w t + C − 1 T x t ( y t − x t w t ) t with a rank-one update of C − 1 t − 1 to get C − 1 . t Consider throwing away second derivative information, replacing with scalar: T w t +1 = w t + η t x t ( y t − x t w t ) . where η t is a decreasing sequence. A. Rakhlin, 9.520/6.860 2018
Online Least Squares The algorithm w t +1 = w t + η t x t ( y t − x t w t ) . T ◮ is recursive; ◮ does not require storing the matrix C − 1 ; t ◮ does not require updating the inverse, but only vector/vector multiplication. However, we are not guaranteed convergence in 1 step. How many? How to choose η t ? A. Rakhlin, 9.520/6.860 2018
First, recognize that t w ) 2 = 2 x t [ y t − x −∇ ( y t − x T t w ] . T Hence, proposed method is gradient descent. Let us study it abstractly and then come back to least-squares. A. Rakhlin, 9.520/6.860 2018
Lemma: Let f be convex G -Lipschitz. Let w ∗ ∈ argmin f ( w ) and w � w ∗ � ≤ B . Then gradient descent w t +1 = w t − η ∇ f ( w t ) B with η = T and w 1 = 0 yields a sequence of iterates such that the √ G � T w T = 1 average ¯ t =1 w t of trajectory satisfies T w T ) − f ( w ∗ ) ≤ BG f ( ¯ √ . T Proof: � w t +1 − w ∗ � 2 = � w t − η ∇ f ( w t ) − w ∗ � 2 = � w t − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 − 2 η ∇ f ( w t ) T ( w t − w ∗ ) Rearrange: T ( w t − w ∗ ) = � w t − w ∗ � 2 − � w t +1 − w ∗ � 2 + η 2 �∇ f ( w t ) � 2 . 2 η ∇ f ( w t ) Note: Lipschitzness of f is equivalent to �∇ f ( w ) � ≤ G . A. Rakhlin, 9.520/6.860 2018
Summing over t = 1 , . . . , T , telescoping, dropping negative term, using w 1 = 0, and dividing both sides by 2 η , T √ T ( w t − w ∗ ) ≤ 1 2 η � w ∗ � 2 + η 2 TG 2 ≤ � ∇ f ( w t ) BGT . t =1 Convexity of f means f ( w t ) − f ( w ∗ ) ≤ ∇ f ( w t ) T ( w t − w ∗ ) and so T T 1 f ( w t ) − f ( w ∗ ) ≤ 1 T ( w t − w ∗ ) ≤ BG � � ∇ f ( w t ) √ T T T t =1 t =1 Lemma follows by convexity of f and Jensen’s inequality. (end of proof) A. Rakhlin, 9.520/6.860 2018
Gradient descent can be written as T ( w − w t ) } + 1 2 � w − w t � 2 w t +1 = argmin η { f ( w t ) + ∇ f ( w t ) w which can be interpreted as minimizing a linear approximation but staying close to previous solution. Alternatively, can interpret it as building a second-order model locally (since cannot fully trust the local information – unlike our first parabola example). A. Rakhlin, 9.520/6.860 2018
Remarks: ◮ Gradient descent for non-smooth functions does not guarantee actual descent of the iterates w t (only their average). ◮ For constrained optimization problems over a set K , do projected gradient step w t +1 = Proj K ( w t − η ∇ f ( w t )) Proof essentially the same. ◮ Can take stepsize η t = BG √ t to make it horizon-independent. ◮ Knowledge of G and B not necessary (with appropriate changes). ◮ Faster convergence under additional assumptions on f (smoothness, strong convexity). ◮ Last class: for smooth functions (gradient is L -Lipschitz), constant step size 1 / L gives faster O (1 / T ) convergence. ◮ Gradients can be replaced with stochastic gradients (unbiased estimates). A. Rakhlin, 9.520/6.860 2018
Stochastic Gradients Suppose we only have access to an unbiased estimate ∇ t of ∇ f ( w t ) at step t . That is, E [ ∇ t | w t ] = ∇ f ( w t ). Then Stochastic Gradient Descent (SGD) w t +1 = w t − η ∇ t enjoys the guarantee w T )] − f ( w ∗ ) ≤ BG √ n E [ f ( ¯ where G is such that E [ �∇ t � 2 ] ≤ G 2 for all t . Kind of amazing: at each step go in the direction that is wrong (but correct on average) and still converge. A. Rakhlin, 9.520/6.860 2018
Stochastic Gradients Setting #1: Empirical loss can be written as n f ( w ) = 1 � ℓ ( y i , w T x i ) = E I ∼ unif[1:n] ℓ ( y I , w T x I ) n i =1 Then ∇ t = ∇ ℓ ( y I , w T t x I ) is an unbiased gradient: T T E [ ∇ t | w t ] = E [ ∇ ℓ ( y I , w t x I ) | w t ] = ∇ E [ ℓ ( y I , w t x I ) | w t ] = ∇ f ( w t ) Conclusion: if we pick index I uniformly at random from dataset and make gradient step ∇ ℓ ( y I , w T t x I ), then we are performing SGD on empirical loss objective. A. Rakhlin, 9.520/6.860 2018
Stochastic Gradients Setting #2: Expected loss can be written as f ( w ) = E ℓ ( Y , w T X ) where ( X , Y ) is drawn i.i.d. from population P X × Y . Then ∇ t = ∇ ℓ ( Y , w T t X ) is an unbiased gradient: E [ ∇ t | w t ] = E [ ∇ ℓ ( Y , w t X ) | w t ] = ∇ E [ ℓ ( Y , w T t X ) | w t ] = ∇ f ( w t ) T Conclusion: if we pick example ( X , Y ) from distribution P X × Y and make gradient step ∇ ℓ ( Y , w T t X ), then we are performing SGD on expected loss objective. Equivalent to going through a dataset once. A. Rakhlin, 9.520/6.860 2018
Stochastic Gradients Say we are in Setting #2 and we go through dataset once. The guarantee is w )] − f ( w ∗ ) ≤ BG E [ f ( ¯ √ T after T iterations. So, time complexity to find ǫ -minimizer of expected objective E ℓ ( w T X , Y ) is independent of the dataset size n !! Suitable for large-scale problems. A. Rakhlin, 9.520/6.860 2018
Stochastic Gradients In practice, we cycle through the dataset several times (which is somewhere between Setting #1 and #2). A. Rakhlin, 9.520/6.860 2018
Appendix A function f : R d → R is convex if f ( α u + (1 − α ) v ) ≤ α f ( u ) + (1 − α ) f ( v ) for any α ∈ [0 , 1] and u , v ∈ R d (or restricted to a convex set). For a differentiable function, convexity is equivalent to monotonicity �∇ f ( u ) − ∇ f ( v ) , u − v � ≥ 0 . (1) where � ∂ f ( u ) , . . . , ∂ f ( u ) � ∇ f ( u ) = . ∂ u 1 ∂ u d A. Rakhlin, 9.520/6.860 2018
Appendix It holds that for a convex differentiable function f ( u ) ≥ f ( v ) + �∇ f ( v ) , u − v � . (2) A subdifferential set is defined (for a given v ) precisely as the set of all vectors ∇ such that f ( u ) ≥ f ( v ) + �∇ , u − v � . (3) for all u . The subdifferential set is denoted by ∂ f ( v ). A subdifferential will often substitute the gradient, even if we don’t specify it. A. Rakhlin, 9.520/6.860 2018
Recommend
More recommend