Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6. Gradient-free methods. Coordinate descent February, 2017 1
Main books: Spall J.C. Introduction to stochastic search and optimization: estimation, simula- tion and control. Wiley, 2003. Nesterov Yu. Random gradient-free minimization of convex functions // CORE Discussion Paper 2011/1. 2011. Nesterov Y.E. Efficiency of coordinate descent methods on large scale optimiza- tion problem // SIAM Journal on Optimization . 2012. V. 22. № 2. P. 341– 362. Fercoq O., Richtarik P. Accelerated, Parallel and Proximal Coordinate Descent // e-print, 2013. arXiv:1312.5799 Duchi J.C., Jordan M.I., Wainwright M.J., Wibisono A. Optimal rates for zero- order convex optimization: the power of two function evaluations // IEEE Trans. of Inf . 2015. V. 61. № 5. P. 2788– 2806. Wright S.J. Coordinate descent algorithms // e-print, 2015. arXiv:1502.04759 Gasnikov A.V. Searching equilibriums in large transport networks. Doctoral The- sis. MIPT, 2016. arXiv:1607.03142 2
Structure of Lecture 6 Two-points gradient free methods and directional derivative methods (Preliminary results) Stochastic Mirror Descent and gradient-free methods The principal difference between one-point and two-points feedbacks Non smooth case (double-smoothing technique) Randomized Similar Triangles Method Randomized coordinate version of Similar Triangles Method Explanations why coordinate descent methods can works better in prac- tice then its full-gradient variants Nesterov’s examples Typical Data Science problem and its consideration from the (primal / dual) randomized coordinate descent point of view 3
Two points gradient-free methods and directional derivative methods f x min. n x All the results can be generalized for composit case (Lecture 3). We assume that N E f x f . * N – number of required iterations (oracle calls): calculations of f (realiza- tions) / directional derivative of f . R – “ distance ” between starting point and the nearest solution. 2 2 2 N f y f x L y x E f x , M E f x , f x D x 2 2 x 2 2 2 2 2 2 f x convex M R 2 2 2 L R L R DR 2 n 2 n 2 n max , 2 2 -strongly f x – 2 n M 2 2 L R L R D 2 2 2 2 n ln 2 2 n max ln , convex in 2 2 2 2 2 4
Stochastic Mirror Descent (SMD) (Lectures 3, 4) Consider convex optimization problem f x min , (1) x Q x f x with stochastic oracle, returns such stochastic subgradient , that: E f x , f x . (2) x p and assume that We introduce norm p -norm ( 1,2 ) with 1 p 1 q 1 2 q . (3) 2 E f x , M , 2, x q d x ( We introduce prox-function ) which is 1-strongly con- 0 d x 0 0 vex due to the p -norm and Bregman ’ s divergence (Lecture 3) V x z , d x d z d z , x z . 5
Method is f x k 1 = Mirr k k k k x h , , Mirr v argmin v, x x V x x , . k x k x x x Q x – is the solution of (1) (if x isn’t unique 2 0 We put R V x x * , , where * * x is minimized ). If – i.i.d. and 0 k then we assume that V x x * , * 1 N 1 R 2 2 0 N k R V x x * , , x x , h . 2 N M N M k 0 Then, after (all the result cited below in this Lecture can be expressed in terms of probability of high deviations bounds, see Lecture 4) 2 2 2 M R N 2 iterations (oracle calls) N E f x f . * 6
Idea (randomization!) n k k k k k k k k f x , : , e : f x e , e , (one-point feedback) (4) x n k k k k k k k k f x , : f x e , f x , e , (two-points feedback) (5) x k k k k k k f x , : n f x , , e e . (directional derivative feedback) (6) x x Assume that f x available with (non stochastic) small noise of level . k k , k How to choose i.i.d. e ? Two main approaches: – ; k n k n e RS 2 1 e is equiprobable distributed on a unit Euclidian sphere in e – with probability 1 n (coordinate descent) for (5), (6). k 0,...,0,1,0,...,0 i 7
in (5) because of in (7)) Note, that ( we can’t tend 0 k k k k k k E n f x , , e e f x , , (see (2)) k x x e 2 3 n 2 k k k k k k 2 2 2 k , , E f x e f x e n L E e k 2 e 4 q q 2 2 n E 2 2 2 2 k k k k k 3 n E f x , , e e 12 e . (see (3)) (7) x k 2 e q q 2 k 2 If E f x , B then k 2 2 2 n n B 2 k k k k k E f x e , e E e . (see (3)) (8) k 2 e q q . The results For coordinate descent randomization it’s optimal to choose p q 2 k n k n will be the same as for e RS 2 1 . Since that we concentrate on e RS 2 1 . 8
n If e RS 2 1 then due to the measure concentration phenomena (I. Usmanova) 2 1 n 2 2 2 , 1 q E e min q 1,4ln n n , E c e , c , 2 q q 2 2 2 4 2 2 2 . q E c e , e c min q 1,4ln n n , 2 q q 2 3 p q ) is already nontrivial! For example, for 1,2 ( 2, So the choice of p ( q ). – unit simplex in , it’s natural to choose n Q S 1 1 n For the function’s values feedback ((4), (5)) we have biased estimation of gr a- dient ((2 ) isn’t still the truth). So one’ve to generalize mentioned above a pproach n n k k k k k k k k k k E f x e , e E f x e , f x , e k k e e if 0 and k k k k E n f x , , e e 0 0 . // because k x e 9
x f x k k , Assume, that instead of real (unbiased) stochastic gradients (see (2)) x f x k k it’s only available biased ones , , that satisfy (3) and additionally N 1 k k k k 1 k 1 k sup E E f x , f x , ,..., , x x , k x x * N N k k 1 k 1 k 1 x x ,..., k 1 then N E f x f . * If is small enough, then one can show (by the optimal choice of ) that for (4): (stochastic) 2 2 2 E f x , M f y f x L y x 2 0 N ( ) R x x x 2 2 2 2 2 * p 2 2 2 1 2 q 2 2 1 2 q f x convex B M R n B L R n 2 2 4 3 -strongly convex in f x – 2 2 2 2 2 2 B M n B L n 2 2 2 3 2 2 2 10
Recommend
More recommend