SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and Technology of China 1
Table of Contents Lagrangian Duality SDCA Convergence Rate Experiments Asynchronous SDCA Q & A 2
Lagrangian Duality
Dual Problem Primal Problem min f 0 ( x ) s . t . f i ( x ) ≤ 0 , i = 1 , 2 · · · , m h i ( x ) = 0 , i = 1 , 2 , · · · , p Lagrangian Function m p � � L ( x , λ, v ) = f 0 ( x ) + λ i f i ( x ) + v i h i ( x ) , λ i ≥ 0 i =1 i =1 Dual Fucntion g ( λ, v ) = inf x ∈ D L ( x , λ, v ) g ( λ, v ) is a concave function. 3
SDCA
Reference Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, Shai Shalev-Shwartz & Tong Zhang, JMLR2013 4
Optimization Objective Formulation w ∈ R d P ( w ) min n P ( w ) := 1 + λ 2 � w � 2 � � w T x i � φ i n i =1 Parameters • x 1 , x 2 , · · · , x n ∈ R d , φ 1 , φ 2 , · · · , φ n : Scalar convex functions. • SGD: O (1 / n ) Examples � w T x i � � 0 , 1 − y i w T x i � • SVM: φ i = max � w T x i � � � − y i w T x i �� • Logistic Regression: φ i = log 1 + exp � 2 � w T x i � � w T x i − y i • Ridge Regression: φ i = 5
Dual Problem Dual Problem max α D ( α ) 2 � � n n D ( α ) = 1 i ( − α i ) − λ 1 � � � � − φ ∗ α i x i � � 2 λ n n � � � � i =1 i =1 Conjugate function: φ ∗ i ( u ) = max z ( zu − φ i ( z )) Derivation n 2 � w � 2 equals to P ( w ) = 1 � w T x i � + λ � φ i n i =1 n 1 φ i ( z i ) + λ � 2 � y � 2 P ( y , z ) = n i =1 y T x i = z i , i = 1 , 2 , · · · , n s . t . 6
Derivation n L ( y , z , α ) = P ( y , z ) + 1 � � y T x i − z i � α i n i =1 D ( α ) = inf y , z L ( y , z , α ) � � n n 1 2 � y � 2 + 1 λ � � α i y T x i = inf z i { φ i ( z i ) − α i z i } + inf n n y i =1 i =1 2 n � n � 1 i ( − α i ) − λ 1 � � � � − φ ∗ = α i x i � � n 2 � λ n � � � i =1 i =1 Relationship n w ( α ) = 1 � α i x i λ n i =1 7
Assumptions L -Lipschitz continuous | φ i ( a ) − φ i ( b ) | ≤ L | a − b | 1 /γ -smooth A function φ i : R → R is (1 /γ )-smooth if it is differentiable and its derivative is (1 /γ )-Lipschitz. Remark if φ i ( a ) is (1 /γ )-smooth, then φ ∗ i is γ strongly convex. 8
Algorithms Figure 1: Procedure SDCA 9
Theorem Th1 Consider Procedure SDCA with α (0) = 0. Assume that φ i is L -Lipschitz for all i . To abtain a duality gap of E [ P ( ¯ w ) − D (¯ α )] ≤ ε , it suffices to have a total number of iterations of T ≥ T 0 + n 4 L 2 λε Th2 Consider Procedure SDCA with α (0) = 0. Assume that φ i is (1 /γ )-smooth for all i . To abtain a duality gap of E [ P ( ¯ w ) − D (¯ α )] ≤ ε , it suffices to have a total number of iterations of � n + 1 � �� n + 1 � · 1 � T ≥ log λγ λγ ε 10
Linear Convergence For Smooth Hinge-Loss Figure 2: Experiments with the smoothed hinge-loss ( γ = 1). 11
Convergence For Non-smooth Hinge-loss Figure 3: Experiments with the hinge-loss (non-smooth) 12
Effect of Smoothness Parameter Figure 4: Duality gap as a function of the number of rounds for different values of γ 13
Comparison To SGD Figure 5: Comparing the primal sub-optimality of SDCA and SGD for the smoothed hinge-loss ( γ = 1) 14
Asynchronous SDCA
Introduction Reference PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Prime Problem n w ∈ R d P ( w ) := 1 2 � w � 2 + � � w T x i � min l i i =1 Dual Problem 2 � � n n α ∈ R d D ( α ) := 1 � � � � l ∗ min α i x i + i ( − α i ) � � 2 � � � � i =1 i =1 15
Algorithm Figure 6: Parallel Asynchronous Stochastic dual Co-ordinate Descent (PASSCoDe) 16
Operation PASSCoDe-Lock • Step 1.5: lock variables in N i := { w t | ( x i ) t � = 0 } • The locks are then released after step 3. • May equal to inconsistent read. PASSCode-Atomic • step 3: For each j ∈ N ( i ), Update w j ← w j + △ α i ( x i ) j atomically. 17
Linear Convergence Rate of PASSCoDe-Atomic Theorem If / √ n ≤ 1 6 τ ( τ + 1) 2 eM � � and � τ 2 M 2 e 2 1 ≥ 2 L max � 1 + e τ M √ n R 2 n min then PASSCoDe-Atomic has a global linear convergence rate in expectation, that is, α j +1 �� α j �� � � − D ( α ∗ ) ≤ η � � � − D ( α ∗ ) � E D E D where α ∗ is the optimal solution and � τ 2 M 2 e 2 � � � κ 1 − 2 L max 1 + e τ M η = 1 − √ n R 2 L max n min 18
Convergence and Efficiency Figure 7: Convergence and Efficiency for news20, covtype, rcv1 datasets 19
Speedup Figure 8: Speedup for news20, covtype, rcv1 datasets 20
Q & A
Recommend
More recommend