High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark Eisen, ONR, NSF DIMACS Workshop on Distributed Optimization, Information Processing, and Learning August 21, 2017 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 1
Introduction Introduction Incremental quasi-Newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 2
Large-scale empirical risk minimization ◮ We would like to solve statistical risk minimization ⇒ min w ∈ R p E θ [ f ( w , θ )] ◮ Distribution unknown, but have access to N independent realizations of θ ◮ We settle for solving the empirical risk minimization (ERM) problem N N 1 1 � � w ∈ R p F ( w ) := min min f ( w , θ i ) = min f i ( w ) w ∈ R p N w ∈ R p N i =1 i =1 ◮ Number of observations N is very large. Large dimension p as well Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 3
Distribute across time and space ◮ Handle large number of observations distributing samples across space and time ⇒ Thus, we want to do decentralized online optimization θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 F θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 θ 1 1 ∼ f 1 θ 1 2 ∼ f 1 θ 1 3 ∼ f 1 θ 1 4 ∼ f 1 1 2 3 4 ◮ We’d like to design scalable decentralized online optimization algorithms ◮ Have scalable decentralized methods. Don’t have scalable online methods Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 4
Optimization methods ◮ Stochastic methods: a subset of samples is used at each iteration ◮ SGD is the most popular; however, it is slow because of ⇒ Noise of stochasticity ⇒ Variance reduction (SAG, SAGA, SVRG, ...) ⇒ Poor curvature approx. ⇒ Stochastic QN (SGD-QN, RES, oLBFGS,...) ◮ Decentralized methods: samples are distributed over multiple processors ⇒ Primal methods: DGD, Acc. DGD, NN, ... ⇒ Dual methods: DDA, DADMM, DQM, EXTRA, ESOM, ... ◮ Adaptive sample size methods: start with a subset of samples and increase the size of training set at each iteration ⇒ Ada Newton ⇒ The solutions are close when the number of samples are close Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 5
Incremental quasi-Newton algorithms Introduction Incremental quasi-Newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 6
Incremental Gradient Descent N ◮ Objective function gradients ⇒ s ( w ) := ∇ F ( w ) = 1 � ∇ f ( w , θ i ) N i =1 ◮ (Deterministic) gradient descent iteration ⇒ w t +1 = w t − ǫ t s ( w t ) ◮ Evaluation of (deterministic) gradients is not computationally affordable ◮ Incremental/Stochastic gradient ⇒ Sample average in lieu of expectations L θ ) = 1 s ( w , ˜ � ˜ ˆ ∇ f ( w , θ l ) θ = [ θ 1 ; ... ; θ L ] L l =1 ◮ Functions are chosen cyclically or at random with or without replacement s ( w t , ˜ ◮ Incremental gradient descent iteration ⇒ w t +1 = w t − ǫ t ˆ θ t ) ◮ (Incremental) gradient descent is (very) slow. Newton is impractical Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 7
Incremental aggregated gradient method ◮ Utilize memory to reduce variance of stochastic gradient approximation ∇ f t ∇ f t ∇ f t 1 it N ∇ f it ( w t +1 ) ∇ f t +1 ∇ f t +1 ∇ f t +1 1 it N N ◮ Descend along incremental gradient ⇒ w t +1 = w t − α i = w t − α g t � ∇ f t i N i =1 ◮ Select update index i t cyclically. Uniformly at random is similar ⇒ ∇ f t +1 = ∇ f i t ( w t +1 ) ◮ Update gradient corresponding to function f i t i t ◮ Sum easy to compute ⇒ g t +1 = g t i − ∇ f t +1 + ∇ f t +1 . Converges linearly i i t i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 8
BFGS quasi-Newton method ◮ Approximate function’s curvature with Hessian approximation matrix B − 1 t w t +1 = w t − ǫ t B − 1 t s ( w t ) ◮ Make B t close to H ( w t ) := ∇ 2 F ( w t ). Broyden, DFP, BFGS ◮ Variable variation: v t = w t +1 − w t . Gradient variation: r t = s ( w t +1 ) − s ( w t ) ◮ Matrix B t +1 satisfies secant condition B t +1 v t = r t . Underdetermined ◮ Resolve indeterminacy making B t +1 closest to previous approximation B t ◮ Using Gaussian relative entropy as proximity condition yields update B t +1 = B t + r t r T T B t v t T r t − B t v t v t t v t T B t v t ◮ Superlinear convergence ⇒ Close enough to quadratic rate of Newton ◮ BFGS requires gradients ⇒ Use incremental gradients Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 9
Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N w t +1 ◮ All gradients, matrices, and variables used to update w t +1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 10
Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N ∇ f it ( w t +1 ) w t +1 ◮ Updated variable w t +1 used to update gradient ∇ f t +1 = ∇ f i t ( w t +1 ) i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 11
Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t z t z t B t B t B t ∇ f t ∇ f t ∇ f t 1 it N 1 it N 1 it N B t +1 ∇ f it ( w t +1 ) w t +1 it ◮ Update B t i t to satisfy secant condition for function f i t for variable variation i t − w t +1 and gradient variation ∇ f t +1 z t − ∇ f t i t (more later) i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 12
Incremental BFGS method ◮ Keep memory of variables z t i , Hessian approximations B t i , and gradients ∇ f t i ⇒ Functions indexed by i . Time indexed by t . Select function f i t at time t z t B t ∇ f t z t z t B t B t ∇ f t ∇ f t it it it 1 N 1 N 1 N B t +1 ∇ f it ( w t +1 ) w t +1 it z t +1 B t +1 ∇ f t +1 z t +1 z t +1 B t +1 B t +1 ∇ f t +1 ∇ f t +1 it it it 1 N 1 N 1 N ◮ Update variable, Hessian approximation, and gradient memory for function f i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 13
Update of Hessian approximation matrices ◮ Variable variation at time t for function f i = f i t ⇒ v t i := z t +1 − z t i i ⇒ r t i := ∇ f t +1 − ∇ f t ◮ Gradient variation at time t for function f i = f i t i t i t ◮ Update B t i = B t i t to satisfy secant condition for variations v t i and r t i i + r t i r t T − B t i v t i v t T B t B t +1 = B t i i i i r t T v t v t T B t i v t i i i i ◮ We want B t i to approximate the Hessian of the function f i = f i t Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 14
A naive (in hindsight) incremental BFGS method ◮ The key is in the update of w t . Use memory in stochastic quantities � − 1 � � N N � 1 1 w t +1 = w t − � B t � ∇ f t i i N N i =1 i =1 ◮ It doesn’t work ⇒ Better than incremental gradient but not superlinear ◮ Optimization updates are solutions of function approximations ◮ In this particular update we are minimizing the quadratic form n f ( w ) ≈ 1 � i ) T ( w − w t ) + 1 � � f i ( z t i ) + ∇ f i ( z t 2( w − w t ) T B t i ( w − w t ) n i =1 ◮ Gradients evaluated at z t i . Secant condition verified at z t i ◮ The quadratic form is centered at w t . Not a reasonable Taylor series Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 15
A proper Taylor series expansion ◮ Each individual function f i is being approximated by the quadratic i ) T ( w − w t ) + 1 f i ( w ) ≈ f i ( z t i ) + ∇ f i ( z t 2( w − w t ) T B t i ( w − w t ) ◮ To have a proper expansion we have to recenter the quadratic form at z t i i ) + 1 f i ( w ) ≈ f i ( z t i ) + ∇ f i ( z t i ) T ( w − z t 2( w − z t i ) T B t i ( w − z t i ) ◮ I.e., we approximate f ( w ) with the aggregate quadratic function N � � f ( w ) ≈ 1 i ) + 1 � f i ( z t i ) + ∇ f i ( z t i ) T ( w − z t 2( w − z t i ) T B t i ( w − z t i ) N i =1 ◮ This is now a reasonable Taylor series that we use to derive an update Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 16
Incremental BFGS ◮ Solving this quadratic program yields the update for the IQN method � − 1 � � N N N � 1 1 i − 1 w t +1 = � B t � B t i z t � ∇ f i ( z t i ) i N N N i =1 i =1 i =1 ◮ Looks difficult to implement but it is more similar to BFGS than apparent ◮ As in BFGS, it can be implemented with O ( p 2 ) operations ⇒ Write as rank-2 update, use matrix inversion lemma ⇒ Independently of N . True incremental method. Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 17
Recommend
More recommend