Estimate Sequences for Variance-Reduced Stochastic Composite Optimization Andrei Kulunchakov Julien Mairal andrei.kulunchakov@inria.fr julien.mairal.@inria.fr International Conference on Machine Learning, 2019 Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Problem statement Assumptions We solve a stochastic composite optimization problem n f ( x ) = 1 � ˜ � � F ( x ) = f ( x ) + ψ ( x ) where f i ( x ) with f i ( x ) = E ξ f i ( x , ξ ) , n i =1 where ψ ( x ) is a convex penalty, each f i is L -smooth and µ -strongly convex. Variance in gradient estimates Stochastic realizations of gradients are available for each i ˜ and Var [ ξ i ] ≤ σ 2 . ∇ f i ( x ) = ∇ f i ( x ) + ξ i with E [ ξ i ] = 0 Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Main contribution (I) Optimal incremental algorithm robust to noise Optimal incremental algorithm with a complexity � �� � �� � � � F ( x 0 ) − F ⋆ σ 2 nL O n + log + O , µ ε µε based on the SVRG gradient estimator with random sampling. Algorithm Briefly, the algorithm is an incremental hybrid of the heavy-ball method with randomly updated SVRG anchor point and two auxiliary sequences, controlling the extrapolation. Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Main contribution (II) Novelty When σ 2 = 0, we recover the same complexity as Katyusha [Allen-Zhu, 2017]. • Novelty: accelerated incremental algorithm robust to σ 2 > 0 with the optimal term σ 2 /µε . • Another contributions • Generic proofs for incremental methods (SVRG, SAGA, MISO, SDCA) to show their robustness to noise � � � F ( x 0 ) − F ⋆ σ 2 n + L �� � �� log + O . O µ ε µε • When µ = 0, we recover optimal rates in fixed horizon and known σ 2 . • Provide a support for non-uniform sampling. Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Side contributions Adaptivity to strong convexity parameter µ When σ = 0, we show adaptivity to µ for all above-mentioned non-accelerated methods. This property is new for SVRG. Accelerated SGD A version of robust accelerated SGD with complexity similar to [Ghadimi and Lan, 2012, 2013] σ 2 + σ 2 �� �� � � � F ( x 0 ) − F ⋆ L n O µ log + O , ε µε where σ 2 n is due to sampling the data points. Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Experiments with three datasets in the experiments — Pascal Large Scale Learning Challenge ( n = 25 · 10 4 ) — Light gene expression data for breast cancer ( n = 295) — CIFAR-10 (images represented by features from a network) with n = 5 · 10 4 Examples with zero noise ( σ = 0) and stochastic case ( σ > 0) 10 0 rand-SVRG 1/12L 10 0 acc-SVRG 1/3L rand-SVRG-d 10 -1 10 -1 acc-SVRG-d log(F/F * -1) SGD-d log(F/F * -1) 10 -2 10 -2 acc-mb-SGD-d 10 -3 10 -3 rand-SVRG 1/12L rand-SVRG 1/3L acc-SVRG 1/3L 10 -4 10 -4 SGD 1/L SGD-d 10 -5 10 -5 acc-SGD-d 0 50 100 150 200 250 300 0 50 100 150 200 250 300 acc-mb-SGD-d Effective passes over data, CIFAR-10 Effective passes over data, Pascal Challenge Poster event-4062, (Jun 12th, Pacific Ballroom 204)
Recommend
More recommend