An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1
Table of Contents Background Moreau Envelop and Douglas-Rachford (DR) Splitting Our methods Theories Experiments Conclusions Q & A 2
Background
Problem Formulation n x ∈R d f ( x ) + h ( x ) := 1 • Regularized ERM: min � f i ( x ) + h ( x ). n i =1 f i : R d → R : empirical loss of i -th sample, convex. • • h : regularization term, convex but possibly non-smooth. • Examples: LASSO, sparse SVM, ℓ 1 , ℓ 2 -Logistic Regression. Definition � 2 γ � y − x � 2 � 1 • Proximal operator: prox γ f ( x ) = argmin y ∈ R d f ( y ) + . • Gradient mapping: f ( x ) = 1 γ ( x − prox γ f ( x )). � � • Subdifferential: ∂ f ( x ) = g | g T ( y − x ) ≤ f ( y ) − f ( x ) , ∀ y ∈ dom f . 2 � y − x � 2 . • Strongly convex: f ( y ) ≥ f ( x ) + � g , y − x � + µ 2 � y − x � 2 . • L -smooth: f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + L 3
Related Works Exsiting Algorithm prox γ h ( x − γ · � ), where � can be obtained from: - GD: � = ∇ f ( x ), more calculations needed in each iteration. - SGD: � = ∇ f i ( x ), small stepsize deduces slow convergence. - Variance reduction (VR): � = ∇ f i ( x ) − ∇ f i (¯ x ) + ∇ f ( x ), such as SVRG, SAGA, SDCA. Accelerated Technique • Ill condition: L /µ , the condition number, is large. • Methods: Acc-SDCA, Catalyst, Mig, Point-SAGA. • Drawbacks: More parameters need to be tuned. 4
Rate Convergence Rate • VR stochastic methods: O (( n + L /µ ) log(1 /ǫ )). � • Acc-SDCA, Mig, Point-SAGA: O (( n + nL /µ ) log(1 /ǫ )). • When L /µ ≫ n , accelerated technique makes the convergence much faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate. 5
Moreau Envelop and Douglas-Rachford (DR) Splitting
Moreau Envelop Formulaton f ( y ) + 1 � 2 γ � x − y � 2 � f γ ( x ) = inf . y Properties • x ∗ minimizes f ( x ) iff x ∗ minimizes f γ ( x ) • f γ is continuously differentiable even when f is non-differentiable, ∇ f γ ( x ) = ( x − prox γ f ( x )) /γ. Moreover, f γ is 1 /γ -smooth. • If f : µ -strongly convex, then f γ : µ/ ( µγ + 1)-strongly convex. • The condition number of f γ is ( µγ + 1) /µγ , which may be better. Proximal Point Algorithm (PPA) x k +1 = prox γ f ( x k ) = x k − γ ∇ f γ ( x k ) . 6
Point-SAGA Formulation n x ∈R d f ( x ) := 1 Used when h is absent: min � f i ( x ) n i =1 Iteration n x k + γ ( g k � z k g k = j − i / n ) , j i =1 x k +1 f j ( z k prox γ = j ) g k +1 ( z k j − x k +1 ) /γ, = j Equivalence n x k +1 = x k − γ � � g k +1 − g k g k � j + i / n , j i =1 where g k +1 is the gradient mapping of f at z k j . j 7
Point-SAGA: Convergence rate Strongly convex and smooth � � � µ ) log(1 n L O ( n + ǫ ) . Strongly convex and non-smooth � 1 � O . ǫ 8
Douglas-Rachford (DR) Splitting Formulation x ∈ R d f ( x ) + h ( x ) , min Iteration − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . prox γ = Convergence • F ( y ) = y + prox γ h (2prox γ f ( y ) − y ) − prox γ f ( y ). • y is a fixed point of F if and only if x = prox γ f ( y ) satisfies 0 ∈ ∂ f ( x ) + ∂ g ( x ): y = F ( y ) 0 ∈ ∂ f (prox γ y ( y )) + ∂ g (prox γ y ( y )) . ⇄ 9
Our methods
Algorithm 10
Iterations Main iterations n j + 1 x k − γ � � � y k +1 g k +1 − g k g k = , i j n i =1 x k +1 h ( y k ) , = prox γ where = 1 j + x k − y k ) − prox f j ( z k j + x k − y k ) g k +1 ( z k � � , j γ j − x k − y k . the gradient mapping of f j at z k Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several 11
Connections to other algorithms Point-SAGA When h = 0, we have x k = y k for Prox2-SAGA, n j − 1 x k + γ � � � z k g k g k = , j i n i =1 x k +1 f j ( z k = prox γ j ) , 1 g k +1 γ ( z k j − x k +1 ) . = j DR splitting j = � n When n = 1, since g k i =1 g k i / n in Prox2-SAGA, − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . = prox γ 12
Theories
Effectiveness Proposition Suppose that ( y ∞ , { g ∞ i } i =1 ,..., n ) is the fixed point of the Prox2-SAGA iteration. Then x ∞ = prox γ h ( y ∞ ) is a minimizer of f + h. Proof. ∵ y ∞ = − x ∞ + y ∞ + prox γ + x ∞ − y ∞ ), which implies f i ( z ∞ i ( z ∞ − y ∞ ) /γ ∈ ∂ f i ( x ∞ ) , i = 1 , . . . , n . (1) i Meanwhile, because x ∞ = prox γ h ( y ∞ ), we have ( y ∞ − x ∞ ) /γ ∈ ∂ h ( x ∞ ) . (2) Observing that n n 1 − y ∞ ) + ( y ∞ − x ∞ ) = 1 − x ∞ = 0 , � � ( z ∞ z ∞ i i n n i =1 i =1 from (1) and (2), we have 0 ∈ ∂ f ( x ∞ ) + ∂ h ( x ∞ ). 13
Convergence Rate Non-strongly convex case Suppose that f i : convex and L -smooth, h : convex. Denote � k j = 1 g k t =1 g t ¯ j , then for Prox2-SAGA with step size γ ≤ 1 / L , at any k time k > 0 it holds n � 2 ≤ 1 � 2 + � 1 � γ ( y 0 − y ∗ ) � 2 � � � g k j − g ∗ � � � g 0 i − g ∗ � � ¯ . E j i k i =1 Strongly convex case Suppose that f i : µ -strongly convex and L -smooth, h : convex. Then for √ � � 9 L 2 +3 µ L − 3 L 1 Prox2-SAGA with stepsize γ = min µ n , , for any time 2 µ L k > 0 it holds n µγ � k · µγ − 2 � 2 ≤ � � 2 + � y 0 − y ∗ � 2 � � � γ ( g 0 � x k − x ∗ � � � � i − g ∗ � 1 − i ) . E 2 µγ + 2 2 − n µγ i =1 14
Remarks - When the stepsize � 1 9 L 2 + 3 µ L − 3 L � � γ = min µ n , , 2 µ L then O ( n + L /µ ) log(1 /ǫ ) steps are required to achieve � 2 ≤ ǫ . � x k − x ∗ � � E - When f i is ill-conditioned, then a large stepsize � 1 36 L 2 − 6( n − 2) µ L � µ n , 6 L + � γ = min 2( n − 2) µ L � is possible, under which the required steps is O ( n + nL /µ ) log(1 /ǫ ). 15
Experiments
Experiments Figure 2: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression. 16
Experiments 17 Figure 3: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression.
Experiments svmguide3 rcv1 10 0 10 0 Prox-SDCA Prox2-SAGA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SAGA, Prox-SGD 10 -2 10 -2 objective gap objective gap 10 -4 10 -4 10 -6 10 -6 0 20 40 60 80 0 10 20 30 40 50 60 70 epoch epoch covtype ijcnn1 10 0 10 0 Prox2-SAGA Prox-SDCA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SGD Prox-SAGA, 10 -1 10 -2 objective gap objective gap 10 -2 10 -3 10 -4 10 -4 10 -5 10 -6 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 epoch epoch Figure 4: Comparison of several algorithms with sparse SVMs. 18
Conclusions
• Prox2-SAGA has combined Point-SAGA and DR splitting. • Point-SAGA provides faster convergence rate to Prox2-SAGA. • DR splitting provides the effectiveness. 19
Q & A
Recommend
More recommend