convex optimization for data science
play

Convex Optimization for Data Science Gasnikov Alexander - PowerPoint PPT Presentation

Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6. Gradient-free methods. Coordinate descent February, 2017 1 Main books: Spall J.C. Introduction to stochastic search and optimization: estimation, simula- tion


  1. Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6. Gradient-free methods. Coordinate descent February, 2017 1

  2. Main books: Spall J.C. Introduction to stochastic search and optimization: estimation, simula- tion and control. Wiley, 2003. Nesterov Yu. Random gradient-free minimization of convex functions // CORE Discussion Paper 2011/1. 2011. Nesterov Y.E. Efficiency of coordinate descent methods on large scale optimiza- tion problem // SIAM Journal on Optimization . 2012. V. 22. № 2. P. 341– 362. Fercoq O., Richtarik P. Accelerated, Parallel and Proximal Coordinate Descent // e-print, 2013. arXiv:1312.5799 Duchi J.C., Jordan M.I., Wainwright M.J., Wibisono A. Optimal rates for zero- order convex optimization: the power of two function evaluations // IEEE Trans. of Inf . 2015. V. 61. № 5. P. 2788– 2806. Wright S.J. Coordinate descent algorithms // e-print, 2015. arXiv:1502.04759 Gasnikov A.V. Searching equilibriums in large transport networks. Doctoral The- sis. MIPT, 2016. arXiv:1607.03142 2

  3. Structure of Lecture 6  Two-points gradient free methods and directional derivative methods (Preliminary results)  Stochastic Mirror Descent and gradient-free methods  The principal difference between one-point and two-points feedbacks  Non smooth case (double-smoothing technique)  Randomized Similar Triangles Method  Randomized coordinate version of Similar Triangles Method  Explanations why coordinate descent methods can works better in prac- tice then its full-gradient variants  Nesterov’s examples  Typical Data Science problem and its consideration from the (primal / dual) randomized coordinate descent point of view 3

  4. Two points gradient-free methods and directional derivative methods    f x min.   n x All the results can be generalized for composit case (Lecture 3). We assume that        N E f x f .   * N – number of required iterations (oracle calls): calculations of f (realiza- tions) / directional derivative of f . R – “ distance ” between starting point and the nearest solution.               2 2             2 N f y f x L y x E f x , M E f x , f x D     x 2 2 x 2 2 2 2     2 2 f x convex M R 2   2 2 L R  L R DR  2 n  2   n 2 n max ,  2    2        -strongly     f x –     2     n M 2   2 L R L R D   2 2  2  2      n ln 2  2    n max ln ,           convex in 2          2 2 2 2 4

  5. Stochastic Mirror Descent (SMD) (Lectures 3, 4) Consider convex optimization problem    f x min , (1)  x Q    x f x  with stochastic oracle, returns such stochastic subgradient , that:           E f x , f x . (2)    x   p    and assume that We introduce norm p -norm ( 1,2 ) with 1 p 1 q 1       2 q   . (3)    2 E f x , M , 2,    x q d x  (   We introduce prox-function    ) which is 1-strongly con- 0 d x 0 0 vex due to the p -norm and Bregman ’ s divergence (Lecture 3)              V x z , d x d z d z , x z . 5

  6. Method is                f x  k 1 = Mirr k k k k x h , , Mirr v argmin v, x x V x x , . k x k x x  x Q    x – is the solution of (1) (if x isn’t unique 2 0 We put R V x x * , , where * * x is minimized   ). If    – i.i.d. and 0 k then we assume that V x x * , *     1 N 1 R 2      2 0 N k R V x x * , , x x , h . 2 N M N M  k 0 Then, after (all the result cited below in this Lecture can be expressed in terms of probability of high deviations bounds, see Lecture 4) 2 2 2 M R  N  2 iterations (oracle calls)        N E f x f .   * 6

  7. Idea (randomization!)       n          k k k k k k k k f x , : , e : f x e , e , (one-point feedback) (4)  x         n          k k k k k k k k f x , : f x e , f x , e , (two-points feedback) (5)  x          k k k k k k f x , : n f x , , e e . (directional derivative feedback) (6) x x Assume that   f x  available with (non stochastic) small noise of level  . k k , k How to choose i.i.d. e ? Two main approaches:    –  ; k n k n e RS 2 1 e is equiprobable distributed on a unit Euclidian sphere in   e  – with probability 1 n (coordinate descent) for (5), (6). k 0,...,0,1,0,...,0      i 7

  8.    in (5) because of   in (7)) Note, that ( we can’t tend 0            k k k k k k E n f x , , e e f x , , (see (2))   k x x e   2       3   n 2           k k k k k k 2 2 2 k , , E f x e f x e n L E e      k 2   e 4 q   q    2 2     n E 2 2 2     2 k k k k k 3 n E f x , , e e 12 e . (see (3)) (7)          x k 2 e q q      2  k 2 If E f x , B then      k   2   2 2   n n B 2      k k k k  k E f x e , e E e . (see (3)) (8)       k 2   e q   q   . The results For coordinate descent randomization it’s optimal to choose p q 2       k n k n will be the same as for e RS 2 1 . Since that we concentrate on e RS 2 1 . 8

  9.    n If e RS 2 1 then due to the measure concentration phenomena (I. Usmanova) 2 1          n  2   2 2    , 1 q E e min q 1,4ln n n , E c e , c , 2 q     q 2 2 2  4      2 2 2      . q E c e , e c min q 1,4ln n n , 2 q   q 2 3     p  q   ) is already nontrivial! For example, for 1,2 ( 2, So the choice of    p  ( q   ). – unit simplex in  , it’s natural to choose n Q S 1 1 n For the function’s values feedback ((4), (5)) we have biased estimation of gr a- dient ((2 ) isn’t still the truth). So one’ve to generalize mentioned above a pproach             n n           k k k k k k k k k k E f x e , e E f x e , f x , e       k k     e   e if 0           and   k k k k E n f x , , e e 0 0  . // because  k x e 9

  10.    x f x  k k , Assume, that instead of real (unbiased) stochastic gradients (see (2))     x f x  k k it’s only available biased ones , , that satisfy (3) and additionally       N 1               k k k k 1 k 1 k sup E E f x , f x , ,..., , x x ,      k x x *     N   N      k k 1 k 1 k 1 x x ,...,  k 1 then          N E f x f .   * If  is small enough, then one can show (by the optimal choice of  ) that for (4):           (stochastic)  2 2           2 E f x , M f y f x L y x 2 0 N ( ) R x x   x 2 2 2 2 2 * p         2 2 2 1 2 q 2 2 1 2 q f x convex B M R n B L R n     2 2     4 3        -strongly convex in     f x – 2 2 2 2 2 2 B M n B L n     2 2  2      3 2     2 2 10

Recommend


More recommend