Large Scale Machine Learning with Stochastic Gradient Descent L´ eon Bottou leon@bottou.org Microsoft (since June)
Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning. iii. Asymptotic Analysis. iv. Learning with a Single Pass. L´ eon Bottou 2/37
I. Learning with Stochastic Gradient Descent L´ eon Bottou 3/37
Example Binary classification – Patterns x . – Classes y = ± 1 . Linear model – Choose features: Φ( x ) ∈ R d � � ⊤ Φ( x ) – Linear discriminant function: f w ( x ) = sign w L´ eon Bottou 4/37
SVM training – Choose loss function ⊤ Φ( x ) � � 1 + e − y w Q ( x, y, w ) = ℓ ( y, f w ( x )) = (e.g.) log � – Cannot minimize the expected risk E ( w ) = Q ( x, y, w ) dP ( x, y ) . n – Can compute the empirical risk E n ( w ) = 1 � Q ( x i , y i , w ) . n i =1 Minimize L 2 regularized empirical risk n λ 2 � w � 2 + 1 � min Q ( x i , y i , w ) w n i =1 Choosing λ is the same setting a constraint � w � 2 < B . L´ eon Bottou 5/37
Batch versus Online Batch: process all examples together (GD) – Example: minimization by gradient descent n λw + 1 ∂Q � Repeat: w ← w − γ ∂w ( x i , y i , w ) n i =1 Online: process examples one by one (SGD) – Example: minimization by stochastic gradient descent Repeat: (a) Pick random example x t , y t λw + ∂Q � � (b) w ← w − γ t ∂w ( x t , y t , w ) L´ eon Bottou 6/37
Second order optimization Batch: (2GD) – Example: Newton’s algorithm n λw + 1 ∂Q Repeat: w ← w − H − 1 � ∂w ( x i , y i , w ) n i =1 Online: (2SGD) – Example: Second order stochastic gradient descent Repeat: (a) Pick random example x t , y t λw + ∂Q � � (b) w ← w − γ t H − 1 ∂w ( x t , y t , w ) L´ eon Bottou 7/37
More SGD Algorithms Adaline (Widrow and Hoff, 1960) y t − w ⊤ Φ( x t ) � � w ← w + γ t Φ( x t ) � 2 Q adaline = 1 � y − w ⊤ Φ( x ) 2 Φ( x ) ∈ R d , y = ± 1 Perceptron (Rosenblatt, 1957) � y t Φ( x t ) if y t w ⊤ Φ( x t ) ≤ 0 w ← w + γ t Q perceptron = max { 0 , − y w ⊤ Φ( x ) } 0 otherwise Φ( x ) ∈ R d , y = ± 1 Multilayer perceptrons (Rumelhart et al., 1986) . . . SVM (Cortes and Vapnik, 1995) . . . Lasso (Tibshirani, 1996) � � λ − ( y t − w ⊤ Φ( x t ))Φ i ( x t ) �� u i ← u i − γ t + � 2 Q lasso = λ | w | 1 + 1 � y − w ⊤ Φ( x ) � � λ + ( y t − w ⊤ �� v i ← v i − γ t t Φ( x t ))Φ i ( x t ) 2 + w = ( u 1 − v 1 , . . . , u d − v d ) with notation [ x ] + = max { 0 , x } . Φ( x ) ∈ R d , y ∈ R , λ > 0 K-Means (MacQueen, 1967) k ∗ = arg min k ( z t − w k ) 2 1 2 ( z − w k ) 2 Q kmeans = min n k ∗ ← n k ∗ + 1 k 1 z ∈ R d , w 1 . . . w k ∈ R d w k ∗ ← w k ∗ + n k ∗ ( z t − w k ∗ ) n 1 . . . n k ∈ N , initially 0 L´ eon Bottou 8/37
II. The Tradeoffs of Large Scale Learning L´ eon Bottou 9/37
The Computational Problem • Baseline large-scale learning algorithm Randomly discarding data is the simplest way to handle large datasets. – What is the statistical benefit of processing more data? – What is the computational cost of processing more data? • We need a theory that links Statistics and Computation! – 1967: Vapnik’s theory does not discuss computation. – 1981: Valiant’s learnability excludes exponential time algorithms, but (i) polynomial time already too slow, (ii) few actual results. L´ eon Bottou 10/37
Decomposition of the Error E ( ˜ f n ) − E ( f ∗ ) = E ( f ∗ F ) − E ( f ∗ ) Approximation error ( E app ) + E ( f n ) − E ( f ∗ F ) Estimation error ( E est ) + E ( ˜ f n ) − E ( f n ) Optimization error ( E opt ) Problem: Choose F , n , and ρ to make this as small as possible, � max number of examples n subject to budget constraints max computing time T Note: choosing λ is the same as choosing F . L´ eon Bottou 11/37
Small-scale Learning “The active budget constraint is the number of examples.” • To reduce the estimation error, take n as large as the budget allows. • To reduce the optimization error to zero, take ρ = 0 . • We need to adjust the size of F . Estimation error Approximation error Size of F See Structural Risk Minimization (Vapnik 74) and later works. L´ eon Bottou 12/37
Large-scale Learning “The active budget constraint is the computing time.” • More complicated tradeoffs. The computing time depends on the three variables: F , n , and ρ . • Example. If we choose ρ small, we decrease the optimization error. But we must also decrease F and/or n with adverse effects on the estimation and approximation errors. • The exact tradeoff depends on the optimization algorithm. • We can compare optimization algorithms rigorously. L´ eon Bottou 13/37
Test Error versus Learning Time Test Error Bayes Limit Computing Time L´ eon Bottou 14/37
Test Error versus Learning Time Test Error 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples. . . L´ eon Bottou 15/37
Test Error versus Learning Time optimizer a Test Error optimizer b optimizer c model I model II model III model IV 10,000 examples 100,000 examples 1,000,000 examples Bayes limit Computing Time Vary the number of examples, the statistical models, the algorithms,. . . L´ eon Bottou 16/37
Test Error versus Learning Time optimizer a Test Error optimizer b optimizer c model I model II model III model IV Good Learning 10,000 examples Algorithms 100,000 examples 1,000,000 examples Bayes limit Computing Time Not all combinations are equal. Let’s compare the red curve for different optimization algorithms. L´ eon Bottou 17/37
III. Asymptotic Analysis L´ eon Bottou 18/37
Asymptotic Analysis f n ) − E ( f ∗ ) = E = E app + E est + E opt E ( ˜ Asymptotic Analysis All three errors must decrease with comparable rates. Forcing one of the errors to decrease much faster - would require additional computing efforts, - but would not significantly improve the test error. L´ eon Bottou 19/37
Statistics Asymptotics of the statistical components of the error – Thanks to refined uniform convergence arguments � α � log n E = E app + E est + E opt ∼ E app + + ρ n with exponent 1 2 ≤ α ≤ 1 . Asymptotically effective large scale learning – Must choose F , n , and ρ such that � α � log n E ∼ E app ∼ E est ∼ E opt ∼ ∼ ρ . n What about optimization times? L´ eon Bottou 20/37
Statistics and Computation GD 2GD SGD 2SGD Time per iteration : n n 1 1 log 1 log log 1 1 1 Iters to accuracy ρ : ρ ρ ρ ρ n log 1 n log log 1 1 1 Time to accuracy ρ : ρ ρ ρ ρ 2 1 1 E 1 /α log 1 1 E log log 1 1 1 E 1 /α log Time to error E : E E E E – 2GD optimizes much faster than GD. – SGD optimization speed is catastrophic. – SGD learns faster than both GD and 2GD. – 2SGD only changes the constants. L´ eon Bottou 21/37
Experiment: Text Categorization Dataset – Reuters RCV1 document corpus. – 781,265 training examples, 23,149 testing examples. Task – Recognizing documents of category CCAT . – 47,152 TF-IDF features. – Linear SVM. Same setup as (Joachims, 2006) and (Shalev-Schwartz et al., 2007) using plain SGD. L´ eon Bottou 22/37
Experiment: Text Categorization • Results: Hinge-loss SVM Q ( x, y, w ) = max { 0 , 1 − yw ⊤ Φ( x ) } λ = 0 . 0001 Training Time Primal cost Test Error SVMLight 23,642 secs 0.2275 6.02% SVMPerf 66 secs 0.2278 6.03% SGD 1.4 secs 0.2275 6.02% • Results: Log-Loss SVM Q ( x, y, w ) = log(1 + exp( − yw ⊤ Φ( x ))) λ = 0 . 00001 Training Time Primal cost Test Error TRON(LibLinear, ε = 0 . 01 ) 30 secs 0.18907 5.68% TRON(LibLinear, ε = 0 . 001 ) 44 secs 0.18890 5.70% SGD 2.3 secs 0.18893 5.66% L´ eon Bottou 23/37
The Wall 0.3 Testing cost 0.2 Training time (secs) 100 SGD 50 TRON (LibLinear) 0.1 0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09 Optimization accuracy (trainingCost−optimalTrainingCost) L´ eon Bottou 24/37
IV. Learning with a Single Pass L´ eon Bottou 25/37
Batch and online paths True solution, ONLINE Best generalization. one pass over * w examples {z1...zt} w t * w t w Best training 1 set error. BATCH many iterations on examples {z1...zt} L´ eon Bottou 26/37
Effect of one Additional Example (i) Compare w ∗ = arg min E n ( f w ) n w E n ( f w ) + 1 � �� w ∗ � n +1 = arg min E n +1 ( f w ) = arg min n ℓ f w ( x n +1 ) , y n +1 w w n+1 E (f ) n+1 w n E (f ) n w w* w* n+1 n L´ eon Bottou 27/37
Effect of one Additional Example (ii) • First Order Calculation � 1 � � n − 1 ∂ ℓ f w n ( x n ) , y n � n H − 1 w ∗ n +1 = w ∗ + O n +1 n 2 ∂w where H n +1 is the empirical Hessian on n + 1 examples. • Compare with Second Order Stochastic Gradient Descent � � w t +1 = w t − 1 t H − 1 ∂ ℓ f w t ( x n ) , y n ∂w • Could they converge with the same speed? • C 2 assumptions = ⇒ Accurate speed estimates. L´ eon Bottou 28/37
Recommend
More recommend