Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu yi-xu@uiowa.edu Computer Science Department The University of Iowa April 18, 2018 Co-authors: Tianbao Yang, Qihang Lin Yi Xu VALSE Webinar Presentation April 18, 2018
Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Outline Introduction 1 Accelerated Stochastic Subgradient Methods 2 Applications and experiments 3 Conclusion 4 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Example in machine learning 500 Table: house price 450 400 house size (sqf) price ($1k) 350 y (price) 300 1 68 500 250 2 220 800 200 150 . . . . . . . . . 100 19 359 1500 50 0 20 266 820 400 600 800 1000 1200 1400 1600 x (size) Linear model: y = f ( w ) = xw , where y = price, x = size. Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Example in machine learning 500 Table: house price 450 400 house size (sqf) price ($1k) 350 y (price) 300 1 68 500 250 2 220 800 200 150 . . . . . . . . . 100 19 359 1500 50 0 20 266 820 400 600 800 1000 1200 1400 1600 x (size) Linear model: y = f ( w ) = xw , where y = price, x = size. Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction 400 350 300 y (price) 250 200 ( x i , f ( x i )) 150 | y i − f ( x i ) | 2 100 ( x i , y i ) 50 400 600 800 1000 1200 1400 1600 x (size) | y 1 − x 1 w | 2 + | y 2 − x 2 w | 2 + . . . | y 20 − x 20 w | 2 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Least squares regression: smooth 25 20 � n w ∈ R F ( w ) = 1 15 ( y i − x i w ) 2 loss min � ������ �� ������ � 10 n i = 1 5 square loss 0 -5 0 5 b-xa Least absolute deviations: non-smooth 5 4.5 4 � n 3.5 w ∈ R F ( w ) = 1 3 min | y i − x i w | loss 2.5 � ���� �� ���� � 2 n 1.5 i = 1 1 absolute loss 0.5 0 -5 0 5 b-xa High dimensional model: � n w ∈ R d F ( w ) = 1 i w | + λ � w � 1 = 1 | y i − x ⊤ n � X w − y � 1 + λ � w � 1 min ���� n i = 1 regularizer absolute loss is more robust to outliers problem ℓ 1 norm regularization is used for feature selection Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Machine learning problems: � n w ∈ R d F ( w ) = 1 min ℓ ( w ; x i , y i ) r ( w ) + � ������ �� ������ � ���� n i = 1 regularizer loss function Classification: hinge loss: ℓ ( w ; x , y ) = max(0 , 1 − y x ⊤ w ) Regression: absolute loss: ℓ ( w ; x , y ) = | x ⊤ w − y | square loss: ℓ ( w ; x , y ) = ( x ⊤ w − y ) 2 Regularizer: ℓ 1 norm: r ( w ) = λ � w � 1 ℓ 2 2 norm: r ( w ) = λ 2 � w � 2 2 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Convex optimization problem Problem: w ∈ R d F ( w ) min F ( w ) : R d → R is convex optimal value: F ( w ∗ ) = min w ∈ R d F ( w ) optimal solution: w ∗ Goal: to find a solution � w F ( � w ) − F ( w ∗ ) ≤ ǫ 0 < ǫ ≪ 1 , (e.g. 10 − 7 ) ǫ -optimal solution: � w Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Complexity measure Most optimization algorithms are iterative 0.3 w t + 1 = w t + ∇ w t 0.25 Iteration complexity : number of 0.2 Objective iterations T ( ǫ ) that 0.15 0.1 F ( w T ) − F ( w ∗ ) ≤ ǫ 0.05 ǫ T where 0 < ǫ ≪ 1 . 0 0 100 200 300 400 500 600 Iterations Time complexity : T ( ǫ ) × C ( n , d ) C ( n , d ) : Per-iteration cost Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Gradient Descent (GD) F ( w ) ≤ F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 smooth 2 Problem: min w ∈ R F ( w ) Gradient ∇ F ( w 0 ) > 0 w t + 1 = arg min w ∈ R F ( w t ) + �∇ F ( w t ) , w − w t � + L 2 � w − w t � 2 2 starting point F ( w ) ( w 0 , F ( w 0 )) GD : initial w 0 ∈ R , for t = 0 , 1 , . . . w t + 1 = w t − η ∇ F ( w t ) η = 1 L > 0 : step size. simple & easy to implement w ∗ Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Accelerated Gradient Descent (AGD) Nesterov’s momentum trick AGD : initial w 0 , v 1 = w 0 , for t = 1 , 2 , . . . : w t = v t − η ∇ F ( v t ) AGD Step v t + 1 = w t + β t ( w t − w t − 1 ) Momentum Step β t ∈ (0 , 1) is momentum parameter. Gradient Step Nesterov’s Accelerated Gradient Theorem ([Beck and Teboulle, 2009]) θ t + 1 ∈ (0 , 1) with θ t + 1 = 1 + √ 1 + 4 θ 2 L , β t = θ t − 1 Let η = 1 t and θ 1 = 1 , then after � � 2 1 , F ( w T ) − F ( w ∗ ) ≤ ǫ T = O √ ǫ Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction SubGradient (SG) descent non-smooth Problem: min w ∈ R F ( w ) SG : initial w 0 , for t = 0 , 1 , . . . subgradient w t + 1 = w t − η∂ F ( w t ) decrease η every iteration. Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ 2 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction SubGradient (SG) descent non-smooth Problem: min w ∈ R F ( w ) SG : initial w 0 , for t = 0 , 1 , . . . subgradient w t + 1 = w t − η∂ F ( w t ) decrease η every iteration. Theorem ([Nesterov, 2004]) � 1 � After T = O , F ( w T ) − F ( w ∗ ) ≤ ǫ ǫ 2 Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Summary of time complexity � n w ∈ R d F ( w ) = 1 min f i ( w ; x i , y i ) n i = 1 Method Time complexity Smooth � nd � GD YES O � ǫ � nd AGD O YES √ ǫ � nd � SG NO O ǫ 2 GD: Gradient Descent AGD: Accelerated Gradient Descent SG: SubGradient descent Yi Xu VALSE Webinar Presentation April 18, 2018
Introduction Challenge of deterministic methods Computing gradient is expensive � n w ∈ R d F ( w ) : = 1 min f i ( w ; x i , y i ) n i = 1 � n ∇ F ( w ) : = 1 ∇ f i ( w ; x i , y i ) n i = 1 When n / d is large: Big Data To compute the gradient, need to pass through all data points. At each updating step, need this expensive computation. Yi Xu VALSE Webinar Presentation April 18, 2018
Recommend
More recommend