stochastic optimization with variance reduction for
play

Stochastic Optimization with Variance Reduction for Infinite - PowerPoint PPT Presentation

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Julien Mairal Alberto Bietti Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, 2017 1 / 20 Stochastic optimization


  1. Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Julien Mairal Alberto Bietti Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21, 2017 1 / 20

  2. Stochastic optimization in machine learning Stochastic approximation : min x E ζ ∼D [ f ( x , ζ )] ◮ Infinite datasets (expected risk, D : data distribution), or “single pass” ◮ SGD, stochastic mirror descent, FOBOS, RDA ◮ O (1 /ǫ ) complexity Incremental methods with variance reduction : min x 1 � n i =1 f i ( x ) n ◮ Finite datasets (empirical risk): f i ( x ) = ℓ ( y i , x ⊤ ξ i ) + ( µ/ 2) � x � 2 ◮ SAG, SDCA, SVRG, SAGA, MISO, etc. ◮ O (log 1 /ǫ ) complexity Alberto Bietti Stochastic MISO March 21, 2017 2 / 20

  3. Data perturbations in machine learning Perturbations of data useful for regularization, stable feature selection, privacy aware learning We focus on data augmentation of a finite training set, for regularization purposes (better performance on test data), e.g.: ◮ Image data augmentation : add random transformations of each image in the training set (crop, scale, rotate, brightness, contrast, etc.) ◮ Dropout : set coordinates of feature vectors to 0 with probability δ . The colorful Norwegian city of Bergen is also a gateway to majes- The colorful of gateway to fjords. tic fjords. Bryggen Hanseatic Wharf Hanseatic Wharf will sense the cul- will give you a sense of the local cul- ture – take some to snap photos the ture – take some time to snap photos commercial buildings, which look of the Hanseatic commercial build- scenery a ings, which look like scenery from a movie set. Figure: Data augmentation on MNIST digit (left), Dropout on text (right). Alberto Bietti Stochastic MISO March 21, 2017 3 / 20

  4. Optimization objective with perturbations n � � F ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] + h ( x ) n x ∈ R p i =1 f i ( x )= E ρ ∼ Γ [˜ f i ( x , ρ )] ρ : perturbation ˜ f i ( · , ρ ) is convex with L -Lipschitz gradients F is µ -strongly convex h : convex, possibly non-smooth, penalty, e.g. ℓ 1 norm Alberto Bietti Stochastic MISO March 21, 2017 4 / 20

  5. Can we do better than SGD? � n � f ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] n x ∈ R p i =1 SGD is a natural choice ◮ Sample index i t , perturbation ρ t ∼ Γ ◮ Update: x t = x t − 1 − η t ∇ ˜ f i t ( x t − 1 , ρ t ) tot := E i ,ρ [ �∇ ˜ O ( σ 2 tot /µ t ) convergence, with σ 2 f i ( x ∗ , ρ ) � 2 ] Key observation : variance from perturbations only is small compared to variance across all examples Contribution : improve convergence of SGD by exploiting the finite-sum structure using variance reduction. Yields O ( σ 2 /µ t ) convergence with ≤ σ 2 ≪ σ 2 � f i ( x ∗ , ρ ) − ∇ f i ( x ∗ ) � 2 � �∇ ˜ E ρ tot Alberto Bietti Stochastic MISO March 21, 2017 5 / 20

  6. Background: MISO algorithm (Mairal, 2015) Finite sum problem: min x f ( x ) = 1 � n i =1 f i ( x ) n i � 2 + c t i ( x ) = µ Maintains a quadratic lower bound model d t 2 � x − z t i on each f i d t i is updated using a strong convexity lower bound on f i : f i ( x ) ≥ f i ( x t − 1 ) + �∇ f i ( x t − 1 ) , x − x t − 1 � + µ 2 � x − x t − 1 � 2 =: l t i ( x ) Two steps: � (1 − α ) d t − 1 ( x ) + α l t i ( x ) , if i = i t ◮ Select i t , update: d t i i ( x ) = d t − 1 ( x ) , otherwise i ◮ Minimize the model: x t = arg min x { D t ( x ) = 1 � n i =1 d t i ( x ) } n Alberto Bietti Stochastic MISO March 21, 2017 6 / 20

  7. MISO algorithm (Mairal, 2015) Final algorithm: at iteration t , choose index i t at random and update: (1 − α ) z t − 1 + α ( x t − 1 − 1 � µ ∇ f i ( x t − 1 )) , if i = i t z t i i = z t − 1 , otherwise. i n x t = 1 � z t i n i =1 Complexity O (( n + L /µ ) log 1 /ǫ ), typical of variance reduction Similar to SDCA without duality (Shalev-Shwartz, 2016) Alberto Bietti Stochastic MISO March 21, 2017 7 / 20

  8. Stochastic MISO � n � f ( x ) = 1 E ρ ∼ Γ [˜ � min f i ( x , ρ )] n x ∈ R p i =1 With perturbations, we cannot compute exact strong convexity lower bounds on f i = E ρ [˜ f i ( · , ρ )] Instead, use approximate lower bounds using stochastic gradient estimates ∇ ˜ f i t ( x t − 1 , ρ t ) Allow decreasing step-sizes α t in order to guarantee convergence as in stochastic approximation Alberto Bietti Stochastic MISO March 21, 2017 8 / 20

  9. Stochastic MISO: algorithm Input: step-size sequence ( α t ) t ≥ 1 ; for t = 1 , . . . do Sample i t uniformly at random, ρ t ∼ Γ, and update: (1 − α t ) z t − 1 + α t ( x t − 1 − 1 µ ∇ ˜ � f i t ( x t − 1 , ρ t )) , if i = i t z t i i = z t − 1 , otherwise. i n x t = 1 i = x t − 1 + 1 � z t n ( z t i t − z t − 1 ) . i t n i =1 end for Note : reduces to MISO for σ 2 = 0 , α t = α , and to SGD for n = 1. Alberto Bietti Stochastic MISO March 21, 2017 9 / 20

  10. Stochastic MISO: convergence analysis i := x ∗ − 1 Define the Lyapunov function (with z ∗ µ ∇ f i ( x ∗ )) n C t = 1 2 � x t − x ∗ � 2 + α t � � z t i − z ∗ i � 2 . n 2 i =1 Theorem (Recursion on C t , smooth case) If ( α t ) t ≥ 1 are positive, non-increasing step-sizes with � 1 n � α 1 ≤ min 2 , , 2(2 κ − 1) with κ = L /µ , then C t obeys the recursion � 2 σ 2 1 − α t � α t � � E [ C t ] ≤ E [ C t − 1 ] + 2 µ 2 . n n Note : Similar recursion for SGD with σ 2 tot instead of σ 2 . Alberto Bietti Stochastic MISO March 21, 2017 10 / 20

  11. Stochastic MISO: convergence with decreasing step-sizes Similar to SGD (Bottou et al., 2016). Theorem (Convergence of Lyapunov function) Let the sequence of step-sizes ( α t ) t ≥ 1 be defined by 2 n � 1 n � α t = for γ ≥ 0 s.t. α 1 ≤ min 2 , . γ + t 2(2 κ − 1) For t ≥ 0 , ν E [ C t ] ≤ γ + t + 1 , where � 8 σ 2 � ν := max µ 2 , ( γ + 1) C 0 . Q : How can we get rid of the dependence on C 0 ? Alberto Bietti Stochastic MISO March 21, 2017 11 / 20

  12. Practical step-size strategy Following Bottou et al. (2016), we keep the step-size constant for a few epochs in order to quickly “forget” the initial condition C 0 Using a constant step-size ¯ α , we can converge linearly near a ασ 2 constant error ¯ C = 2¯ n µ 2 (in practice: a few epochs) We then start decreasing step-sizes with γ large enough s.t. α 1 = 2 n / ( γ + 1) ≈ ¯ α , no more C 0 in the convergence rate! Overall, complexity for reaching E [ � x t − x ∗ � 2 ] ≤ ǫ : � � σ 2 ( n + L /µ ) log C 0 � � O + O . µ 2 ǫ ǫ ¯ For E [ f ( x t ) − f ( x ∗ )] ≤ ǫ , the second term becomes O ( L σ 2 /µ 2 ǫ ) via smoothness. Iterate averaging brings this down to O ( σ 2 /µǫ ). Alberto Bietti Stochastic MISO March 21, 2017 12 / 20

  13. Extensions Composite objectives ( h � = 0, e.g., ℓ 1 penalty) ◮ MISO extends to this case by adding h to lower bound model (Lin et al., 2015) ◮ Different Lyapunov function ( � x t − x ∗ � 2 replaced by an upper bound) ◮ Similar to Regularized Dual Averaging when n = 1 Non-uniform sampling ◮ Smoothness constants L i of each ˜ f i can vary a lot in heterogeneous datasets ◮ Sampling “difficult” examples more often can improve dependence in L from L max to L average Same convergence results apply (same Lyapunov recursion, decreasing step-sizes, iterate averaging) Alberto Bietti Stochastic MISO March 21, 2017 13 / 20

  14. Experiments: dropout Dropout rate δ controls the variance of the perturbations. gene dropout, δ = 0.30 gene dropout, δ = 0.10 gene dropout, δ = 0.01 10 0 10 0 10 0 S-MISO η = 0 . 1 10 -1 10 -1 10 -1 S-MISO η = 1 . 0 10 -2 SGD η = 0 . 1 10 -2 SGD η = 1 . 0 10 -2 10 -3 f - f* N-SAGA η = 0 . 1 f - f* f - f* 10 -3 N-SAGA η = 1 . 0 10 -4 10 -3 10 -4 10 -5 10 -4 10 -5 10 -6 10 -5 10 -6 10 -7 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 epochs epochs epochs Alberto Bietti Stochastic MISO March 21, 2017 14 / 20

  15. Experiments: image data augmentation Random image crops and scalings, encoding with an unsupervised deep convolutional network. Different conditioning, controlled by µ . STL-10 ckn, µ = 10 − 3 STL-10 ckn, µ = 10 − 4 STL-10 ckn, µ = 10 − 5 10 0 10 0 10 0 S-MISO η = 0 . 1 10 -1 S-MISO η = 1 . 0 10 -1 SGD η = 0 . 1 10 -1 10 -2 SGD η = 1 . 0 f - f* N-SAGA η = 0 . 1 f - f* f - f* 10 -2 10 -3 10 -2 10 -3 10 -4 10 -5 10 -4 10 -3 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 250 300 350 400 epochs epochs epochs Alberto Bietti Stochastic MISO March 21, 2017 15 / 20

  16. Conclusion Exploit underlying finite-sum structures in stochastic optimization problems using variance reduction Bring SGD variance term down to the variance induced by perturbations only Useful for data augmentation (e.g. random image transformations, Dropout) Future work: application to stable feature selection? C++/Eigen library with Cython extension available: http://github.com/albietz/stochs Alberto Bietti Stochastic MISO March 21, 2017 16 / 20

Recommend


More recommend