PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De scent Cho-Jui Hsieh Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu and I. S. Dhillon Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1 / 29
Outline L2-regularized Empirical Risk Minimization Dual Coordinate Descent (Hsieh et al., 2008) Parallel Dual Coordinate Descent (on multi-core machines) Theoretical Analysis Experimental Results Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 2 / 29
L2-regularized ERM n w ∈ R d P ( w ) := 1 w ∗ = arg min 2 � w � 2 + � ℓ i ( w T x i ) i =1 SVM with hinge loss: ℓ i ( z i ) = C max (1 − z i , 0) SVM with squared hinge loss: ℓ i ( z i ) = C max (1 − z i , 0) 2 Logistic regression: ℓ i ( z i ) = C log (1 + e − z i ) Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 3 / 29
Primal and Dual Formulations Primal Problem n w ∈ R d P ( w ) := 1 w ∗ = arg min 2 � w � 2 + � ℓ i ( w T x i ) i =1 Dual Problem 2 � n � n α ∈ R n D ( α ) := 1 � � α ∗ = arg min � � + ℓ ∗ i ( − α i ) , α i x i � � 2 � � � � i =1 i =1 ℓ ∗ i ( · ): the conjugate of ℓ i ( · ) Primal-Dual Relationship between w ∗ and α ∗ n w ∗ = w ( α ∗ ) := � α ∗ i x i i =1 Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 4 / 29
Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 5 / 29
Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ � 2 � δ + ( � n i =1 α i x i ) T x i 1 1 � x i � 2 ℓ ∗ = arg min + i ( − ( α i + δ )) � x i � 2 2 δ � n � T � = T i α i x i x i , α i i =1 Simple univariate problem, but O ( nnz ) construction time Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 6 / 29
Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ � 2 � δ + ( � n i =1 α i x i ) T x i 1 1 � x i � 2 ℓ ∗ = arg min + i ( − ( α i + δ )) � x i � 2 2 δ � n � T � = T i α i x i x i , α i i =1 Simple univariate problem, but O ( nnz ) construction time ⇒ O ( n i ) DCD: [Hsieh et al 2008] i =1 α i x i and δ ∗ = T i Maintain primal variable w = � n � w T x i , α i � O ( n i ) construction time: n i = nnz of x i O ( n i ) maintenance cost: w ← w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 7 / 29
Dual Coordinate Descent Stochastic Dual Coordinate Descent For t = 1 , 2 , . . . 1 Randomly pick an index i 2 Compute w T x i 3 Update α i ← α i + δ ∗ where δ ∗ = T i ( w T x i , α i ) 4 Update w ← w + δ ∗ x i . Implemented in LIBLINEAR: Linear SVM (Hsieh et al., 2008), multi-class SVM (Keerthi et al., 2008), Logistic regression (Yu et al., 2011). Analysis: (Nesterov et al., 2012; Shalev-Shwartz et al., 2013) Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 8 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 0 Registers: R1 R2 R3 R4 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: x 32 R1 R3 R4 operation: Load x 32 to R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: w 2 x 32 R3 R4 operation: Load w 2 to R1 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: w 2 x 32 w T x i R4 operation: R3 = R3 + R1 × R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 2 x 34 w T x i R4 operation: Load x 34 to R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 4 x 34 w T x i R4 operation: Load w 4 to R1 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 4 x 34 w T x i R4 operation: R3 = R3 + R1 × R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 0 Registers: w 4 α 3 w T x i R4 operation: Load α 3 to R2 DCD step: compute δ ∗ = T i � w T x , α i � Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: R4 = T i (R2,R3) DCD step: compute δ ∗ = T i � w T x , α i � Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 1 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: R2 = R2 + R4 DCD step: update α i = α i + δ ∗ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 1 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: Save R2 to α 3 DCD step: update α i = α i + δ ∗ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 1 Registers: w 4 x 32 w T x i δ ∗ operation: Load x 32 to R2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: Load w 2 to R1 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: R1 = R1 + R2 × R4 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: Save R1 to w 2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 4 0 1 Registers: w 2 x 34 w T x i δ ∗ operation: Load x 34 to R2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 1 Registers: w 4 x 34 w T x i δ ∗ operation: Load w 4 to R1 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29
Recommend
More recommend