Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy 21 Aug 2013 1 Joint work with Yiming Yang presented at ICML’13 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Outline of the Talk Logistic Models Maximum Likelihood Estimation Parallelization Experiments Siddharth Gopal Distributed Training for Large-scale Logistic Models
Logistic Models Logistic Models model probability of an outcome Y given a predictor x . P ( Y = y | x ; w ) ∝ exp( w ⊤ φ ( y , x )) Subsumes Multinomial Logistic Regression, Conditional Random fields and Maximum entropy Models. For example, in Multinomial Logistic Regression exp( w ⊤ k x ) P ( Y = k | x ; w ) = � exp( w ⊤ j x ) j Siddharth Gopal Distributed Training for Large-scale Logistic Models
Focus of the Talk Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes Siddharth Gopal Distributed Training for Large-scale Logistic Models
Focus of the Talk Train Logistic models on large-scale data. What is Large-scale ? Large number of Training Examples High dimensionality Large number of Outcomes Siddharth Gopal Distributed Training for Large-scale Logistic Models
Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - Siddharth Gopal Distributed Training for Large-scale Logistic Models
Motivation Some commonly used data on the web, Dataset #Instances #Labels #Features #Parameters ODP subset 93,805 12,294 347,256 4,269,165,264 Wikipedia subset 2,365,436 325,056 1,617,899 525,907,777,344 Image-net 14,197,122 21,841 - - How can we parallelize the training of such models ? How can we optimize different subsets of parameters simultaneously ? Siddharth Gopal Distributed Training for Large-scale Logistic Models
Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Maximum Likelihood Estimation (MLE) Typical MLE estimation N training examples, K classes. x i denotes the i th training example. Indicator variable y ik denotes whether x i belongs to class k . Estimate parameters w by maximizing the log-likelihood, N K y ik log P ( y ik | x i ; w ) − λ � � 2 � w � 2 max w i =1 k =1 � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ [ OPT1 ] min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Parallelization � K N K N � λ 2 � w � 2 − � � � � y ik w ⊤ exp( w ⊤ min k x i + log k x i ) w i =1 k =1 i =1 k =1 The log-sum-exp (LSE) function couples all the class-level parameter w k ’s together. Replace LSE by a parallelizable function This parallelizable function should be an upper-bound It should not make the optimization harder - like introduce non-convexity, non-differentiability etc. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 1 - Piecewise Linear Bound (Hsiung et al) Properties used LSE is a convex-function Convex function can be approximated to any precision by piecewise linear functions. � K � � { a ⊤ j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Upper Bound LSE Lower Bound Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 1 - Piecewise Linear Bound (Hsiung et al) � K � { a ⊤ � j ′ { c ⊤ max j γ + b j } ≤ log exp( γ k ) ≤ max j ′ γ + d j ′ } j k =1 a , c ∈ R K b , d ∈ R Advantages The bound can be made arbitrarily accurate by increasing the number of pieces. Disadvantages Max-function makes the objective non-differentiable. The number of variational parameters grows with the approximation level. Optimizing the variational parameter is hard. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 2 - Double Majorization (Bouchard 2007) The LSE is bound by, � K � K � � exp( w ⊤ log(1 + exp( w ⊤ log k x i ) ≤ a i + k x i − a i )) , a i ∈ R k =1 k =1 Advantages The bound is parallelizable. It is an upper bound. It is differentiable and convex . Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 2 - Double Majorization (Bouchard 2007) Disadvantage The bound is not tight enough. Efficiency of Bound 3.0E+04 2.5E+04 Log-sum-exp Upper-bound Function-value 2.0E+04 1.5E+04 1.0E+04 5.0E+03 0.0E+00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Iteration The gap between true objective and upper-bounded objective on the 20-newsgroup dataset. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 3 - Log Concavity A relatively famous bound using the concavity of the log-function log( x ) ≤ ax − log( a ) − 1 ∀ x , a > 0 Log Concavity Bound 5 3 1 log(x) -1 log(x) -3 a = .3 a = 2 -5 a = .02 -7 x Siddharth Gopal Distributed Training for Large-scale Logistic Models
Bound 3 - Log Concavity Applying to the LSE function, � K � K � � exp( w ⊤ exp( w ⊤ log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Advantages The bound is parallelizable. It is differentiable. Optimizing the variational parameter a i is easy. 1 The upper bound is exact at a i = . K exp( w ⊤ � k x i ) k =1 Disadvantage The combined objective is non-convex. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Reaching Optimality � K � N K N λ 2 � w � 2 − � � y ik w ⊤ � � exp( w ⊤ MLE estimation min k x i + log k x i ) w i =1 k =1 i =1 k =1 � K � K � exp( w ⊤ � exp( w ⊤ Log-concavity Bound log k x i ) ≤ a i k x i ) − log( a i ) − 1 k =1 k =1 Combined Objective K N � K K � F ( W , A ) = λ � w k � 2 + � � � � y ik w ⊤ exp( w ⊤ − k x i + a i k x i ) − log( a i ) − 1 2 k =1 i =1 k =1 k =1 Despite the non-convexity, we can show that The combined objective has a unique minima. This minimum coincides with the optimal MLE solution. Siddharth Gopal Distributed Training for Large-scale Logistic Models
Reaching Optimality An iterative and parallel block coordinate descent algorithm to converge to the unique minimum. Algorithm 1 A parallel block coordinate descent Initialize : t ← 0 , A 0 ← 1 K , W 0 ← 0. While : Not converged In parallel : W t +1 ← arg min W F ( W , A t ) A t +1 ← arg min A F ( W t +1 , A ) t ← t + 1 Siddharth Gopal Distributed Training for Large-scale Logistic Models
Experimental Comparison Datasets Dataset # instances #Leaf-labels #Features #Parameters Parameter Size (approx) CLEF 10,000 63 80 5,040 40KB NEWS20 11,260 20 53,975 1,079,500 4MB LSHTC-small 4,463 1,139 51,033 227,760,279 911MB 93,805 12,294 347,256 4,269,165,264 17GB LSHTC-large Optimization Methods Double Majorization Bound (DM) Log concavity Bound (LC) Limited Memory BFGS (LBFGS) - the most widely used method. Alternating Direction Method of Multipliers (ADMM) Siddharth Gopal Distributed Training for Large-scale Logistic Models
Recommend
More recommend