large scale machine learning
play

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - PowerPoint PPT Presentation

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 104 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression


  1. k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  2. k -means example Iris k−means, k = 2 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  3. k -means example Iris k−means, k = 3 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  4. k -means example Iris k−means, k = 4 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  5. k -means example Iris k−means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

  6. k -means complexity Each update step: O ( nd ) Each assgnment step: O ( ndk ) 27 / 104

  7. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 28 / 104

  8. Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104

  9. Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104

  10. Model Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × R Fit a linear function: f β ( x ) = β ⊤ x Goodness of fit measured by residual sum of squares: n � ( y i − f β ( x i )) 2 RSS ( β ) = i =1 Ridge regression minimizes the regularized RSS: d � β 2 min β RSS ( β ) + λ i i =1 Solution (set gradient to 0): � � − 1 ˆ X ⊤ X + λ I X ⊤ Y β = 30 / 104

  11. Ridge regression complexity Compute X ⊤ X : O ( nd 2 ) � � X ⊤ X + λ I : O ( d 3 ) Inverse Computing X ⊤ X is more expensive than inverting it! 31 / 104

  12. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 32 / 104

  13. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  14. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  15. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  16. Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

  17. k -nearest neigbors (kNN) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o (Hastie et al. The elements of statistical learning. Springer, 2001.) Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } No training Given a new point x ∈ R d , predict the majority class among its k nearest neighbors (take k odd) 34 / 104

  18. kNN properties Uniform Bayes consistency [Stone, 1977] Take k = √ n (for example) Let P be any distribution over ( X , Y ) pairs Assume training data are random pairs sampled i.i.d. according to P Then the k -NN classifier ˆ f n satisfies almost surely: n → + ∞ P (ˆ lim f ( X ) � = Y ) = fmeasurable P ( f ( X ) � = Y ) inf Complexity: Memory: story X is O ( nd ) Training time: 0 Prediction: O ( nd ) for each test point 35 / 104

  19. Linear models for classification Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } Fit a linear function f β ( x ) = β ⊤ x The prediction on a new point x ∈ R d is: � +1 if f β ( x ) > 0 , − 1 otherwise. 36 / 104

  20. Large-margin classifiers For any f : R d → R , the margin of f on an ( x , y ) pair is yf ( x ) Large-margin classifiers fit a classifier by maximizing the margins on the training set: n � ℓ ( y i f β ( x i )) + λβ ⊤ β min β i =1 for a convex, non-increasing loss function ℓ : R → R + 37 / 104

  21. Loss function examples Loss Method ℓ ( u ) 1( u ≤ 0) 0-1 none Hinge Support vector machine (SVM) max (1 − u , 0) log (1 + e − u ) Logistic Logistic regression (1 − u ) 2 Square Ridge regression 38 / 104

  22. Ridge logistic regression [Le Cessie and van Houwelingen, 1992] n � � � 1 + e − y i β ⊤ x i + λβ ⊤ β β ∈ R p J ( β ) = min ln i =1 Can be interpreted as a regularized conditional maximum likelihood estimator No explicit solution, but smooth convex optimization problem that can be solved numerically by Newton-Raphson iterations: � � β old �� − 1 � β old � β new ← β old − ∇ 2 ∇ β J . β J Each iteration amounts to solving a weighted ridge regression problem, hence the name iteratively reweighted least squares (IRLS). Complexity O ( iterations ∗ ( nd 2 + d 3 )) 39 / 104

  23. SVM [Boser et al., 1992] n � � � 0 , 1 − y i β ⊤ x i + λβ ⊤ β min max β ∈ R p i =1 A non-smooth convex optimization problem (convex quadratic program) Equivalent to the dual problem 0 ≤ y i α i ≤ 1 α ∈ R n 2 α ⊤ Y − α ⊤ XX ⊤ α max s.t. 2 λ for i = 1 , . . . , n The solution β ∗ of the primal is obtained from the solution α ∗ of the dual: β ∗ = X ⊤ α ∗ f β ∗ ( x ) = ( β ∗ ) ⊤ x = ( α ∗ ) ⊤ Xx Training complexity: O ( n 2 ) to store XX ⊤ , O ( n 3 ) to find α ∗ Prediction: O ( d ) for ( β ∗ ) ⊤ x , O ( nd ) for ( α ∗ ) ⊤ Xx 40 / 104

  24. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 41 / 104

  25. Motivation ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x 42 / 104

  26. Model Learn a function f : R d → R of the form n � f ( x ) = α i K ( x i , x ) i =1 For a positive definite (p.d.) kernel K : R d × R d → R , such as K ( x , x ′ ) = x ⊤ x ′ Linear � � p x ⊤ x ′ + c K ( x , x ′ ) = Polynomial � � x − x ′ � 2 � K ( x , x ′ ) = exp Gaussian 2 σ 2 d min( | x i | , | x ′ i | ) � K ( x , x ′ ) = Min/max max( | x i | , | x ′ i | ) i =1 43 / 104

  27. Feature space A function K : R d × R d → R is a p.d. kernel if and only if there existe a mapping Φ : R d → R D , for some D ∈ N ∪ { + ∞} , such that ∀ x , x ′ ∈ R d , K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) f is then a linear function in R D : n n � � α i Φ( x i ) ⊤ Φ( x ) = β ⊤ Φ( x ) f ( x ) = α i K ( x i , x ) = i =1 i =1 for β = � n i =1 α i Φ( x i ). x1 2 x1 x2 R 2 x2 44 / 104

  28. Learning 2 x1 x1 x2 R 2 x2 We can learn f ( x ) = � n i =1 α i K ( x i , x ) by fitting a linear model β ⊤ Φ( x ) in the feature space Example: ridge regression / logistic regression / SVM n � ℓ ( y i , β ⊤ Φ( x i )) + λβ ⊤ β min β ∈ R D i =1 But D can be very large, even infinite... 45 / 104

  29. Kernel tricks K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) can be quick to compute even if D is large (even infinite) For a set of training samples { x 1 , . . . , x n } ⊂ R d let K n the n × n Gram matrix: [ K n ] ij = K ( x i , x j ) For β = � n i =1 α i Φ( x i ) we have β ⊤ Φ( x i ) = [ K α ] i β ⊤ β = α ⊤ K α and We can therefore solve the equivalent problem in α ∈ R n n � ℓ ( y i , [ K α ] i ) + λα ⊤ K α min α ∈ R n i =1 46 / 104

  30. Example: kernel ridge regression (KRR) n � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β min β ∈ R d i =1 Solve in R D : � � − 1 ˆ Φ( X ) ⊤ Φ( X ) + λ I Φ( X ) ⊤ Y β = � �� � D × D Solve in R n : α = ( K + λ I ) − 1 ˆ Y � �� � n × n 47 / 104

  31. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  32. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 1000 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  33. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 100 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  34. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 10 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  35. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  36. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  37. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.01 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  38. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  39. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.0001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  40. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.00001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  41. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  42. KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.0000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

  43. Complexity lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x Compute K : O ( dn 2 ) Store K : O ( n 2 ) Solve α : O ( n 2 ∼ 3 ) Compute f ( x ) for one x : O ( nd ) Unpractical for n > 10 ∼ 100 k 49 / 104

  44. Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 50 / 104

  45. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 51 / 104

  46. What is ”large-scale”? Data cannot fit in RAM Algorithm cannot run on a single machine in reasonable time (algorithm-dependent) Sometimes even O ( n ) is too large! (e.g., nearest neighbor in a database of O ( B +) items) Many tasks / parameters (e.g., image categorization in O (10 M ) classes) Streams of data 52 / 104

  47. Things to worry about Training time (usually offline) Memory requirements Test time Complexities so far Method Memory Training time Test time O ( d 2 ) O ( nd 2 ) PCA O ( d ) k -means O ( nd ) O ( ndk ) O ( kd ) O ( d 2 ) O ( nd 2 ) Ridge regression O ( d ) kNN O ( nd ) 0 O ( nd ) O ( nd 2 ) Logistic regression O ( nd ) O ( d ) O ( n 2 ) O ( n 3 ) SVM, kernel methods O ( nd ) 53 / 104

  48. Techniques for large-scale machine learning Good baselines: Subsample data and run standard method Split and run on several machines (depends on algorithm) Need to revisit standard algorithms and implementation, taking into account scalability Trade exactness for scalability Compress, sketch, hash data in a smart way 54 / 104

  49. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 55 / 104

  50. Motivation Classical learning theory analyzes the trade-off between: approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters) ℱ But reaching the best trade-off for a given n may be impossible with limited computational resources We should include in the trade-off the computational budget, and see which optimization algorithm gives the best trade-off! Seminal paper of Bottou and Bousquet [2008] 56 / 104

  51. Classical ERM setting Goal: learn a function f : R d → Y ( Y = R or {− 1 , 1 } ) P unknown distribution over R d × Y Training set: S = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } ⊂ R d × Y i.i.d. following P � f : R d → R � Fix a class of functions F ⊂ Choose a loss ℓ ( y , f ( x )) Learning by empirical risk minimization n f ∈F R n [ f ] = 1 � f n ∈ arg min ℓ ( Y i , f ( X i )) n i =1 Hope that f n has a small risk: R [ f n ] = E ℓ ( Y , f n ( X )) 57 / 104

  52. Classical ERM setting The best possible risk is R ∗ = f : R d →Y R [ f ] min The best achievable risk over F is R ∗ F = min f ∈F R [ f ] We then have the decomposition R [ f n ] − R ∗ = R [ f n ] − R ∗ R ∗ F − R ∗ + F � �� � � �� � estimation error ǫ est approximation errror ǫ app ℱ 58 / 104

  53. Optimization error Solving the ERM problem may be hard (when n and d are large) Instead we usually find an approximate solution ˜ f n that satisfies R n [˜ f n ] ≤ R n [ f n ] + ρ The excess risk of ˜ f n is then f n ] − R ∗ = ǫ = R [˜ R [˜ f n ] − R [ f n ] + ǫ est + ǫ app � �� � optimization error ǫ opt 59 / 104

  54. A new trade-off ǫ = ǫ app + ǫ est + ǫ opt Problem Choose F , n , ρ to make ǫ as small as possible Subject to a limit on n and on the computation time T Table 1: Typical variations when F , n , and ρ increase. F n ρ E app (approximation error) ↘ E est (estimation error) ↗ ↘ E opt (optimization error) · · · · · · ↗ T (computation time) ↗ ↗ ↘ Large-scale or small-scale? Small-scale when constraint on n is active Large-scale when constraint on T is active 60 / 104

  55. Comparing optimization methods n � β ∈B⊂ R d R n [ f β ] = min ℓ ( y i , f β ( x i )) i =1 Gradient descent (GD): β t +1 ← β t − η∂ R n ( f β t ) ∂β Second-order gradient descent (2GD), assuming Hessian H known β t +1 ← β t − H − 1 ∂ R n ( f β t ) ∂β Stochastic gradient descent (SGD): β t +1 ← β t − η ∂ℓ ( y t , f β t ( x t )) t ∂β 61 / 104

  56. Results [Bottou and Bousquet, 2008] Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E ≤ c ( E app + ε ) � � � � � � d 2 κ κ log 1 nd κ log 1 ε 1 /α log 2 1 GD O ( nd ) O O O ρ ρ ε � � �� � � � � d 2 + nd � d 2 + nd � d 2 log log 1 log log 1 ε 1 /α log 1 ε log log 1 2GD O O O O ρ ρ ε � � � � � � νκ 2 dνκ 2 d ν κ 2 1 SGD O ( d ) ρ + o O O ρ ρ ε � � � � � � � � 2SGD α ∈ [1 / 2 , 1] comes from the bound on ε est and depends on the data In the last column, n and ρ are optimized to reach ǫ for each method 2GD optimizes much faster than GD, but limited gain on the final performance limited by ǫ − 1 /α coming from the estimation error SGD: Optimization speed is catastrophic Learning speed is the best, and independent of α This suggests that SGD is very competitive (and has become the de facto standard in large-scale ML) 62 / 104

  57. Illustration https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104

  58. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 64 / 104

  59. Motivation Affects scalability of algorithms, e.g., O ( nd ) for kNN or O ( d 3 ) for ridge regression Hard to visualize (Sometimes) counterintuitive phenomena in high dimension, e.g., concentration of measure for Gaussian data d=1 d=10 d=100 400 250 150 300 200 Frequency Frequency Frequency 150 100 200 100 100 50 50 0 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ||x||/sqrt(k) ||x||/sqrt(k) ||x||/sqrt(k) Statistical inference degrades when d increases (curse of dimension) 65 / 104

  60. Dimension reduction with PCA PC1 PC2 Projects data onto k < d dimensions that captures the largest amount of variance Also minimizes total reconstruction errors: n � � x i − Π S k ( x i ) � 2 min S k i =1 But computational expensive: O ( nd 2 ) No theoretical garantee on distance preservation 66 / 104

  61. Linear dimension reduction X ′ = X × R ���� ���� ���� n × k n × d d × k Can we find R efficiently? Can we preserve distances? ∀ i , j = 1 , . . . , n , � f ( x i ) − f ( x j ) � ≈ � x i − x j � Note: when d > n , we can take k = n and preserve all distances exactly (kernel trick) 67 / 104

  62. Random projections Simply take a random projection matrix: 1 R ⊤ x √ R ij ∼ N (0 , 1) f ( x ) = with k Theorem [Johnson and Lindenstrauss, 1984] For any ǫ > 0 and n ∈ N , take � − 1 log( n ) ≈ ǫ − 2 log( n ) . � ǫ 2 / 2 − ǫ 3 / 3 k ≥ 4 Then the following holds with probabiliy at least 1 − 1 / n : (1 − ǫ ) � x i − x j � 2 ≤ � f ( x i ) − f ( x j ) � 2 ≤ (1+ ǫ ) � x i − x j � 2 ∀ i , j = 1 , . . . , n k does not depend on d ! n = 1 M , ǫ = 0 . 1 = ⇒ k ≈ 5 K n = 1 B , ǫ = 0 . 1 = ⇒ k ≈ 8 K 68 / 104

  63. Proof (1/3) For a single dimension, q j = r ⊤ j u : E ( q j ) = E ( r j ) ⊤ u = 0 E ( q j ) 2 = u ⊤ E ( r j r ⊤ j ) u = � u � 2 √ kR ⊤ u : For the k -dimensional projection f ( u ) = 1 / k j ∼ � u � 2 � f ( u ) � 2 = 1 � q 2 χ 2 ( k ) k k j =1 k E � f ( u ) � 2 = 1 � E ( q 2 j ) = � u � 2 k j =1 Need to show that � f ( u ) � 2 is concentrated around its mean 69 / 104

  64. Proof (2/3) � � f � 2 > (1 + ǫ ) � u � 2 � P � � χ 2 ( k ) > (1 + ǫ ) k = P � e λχ 2 ( k ) > e λ (1+ ǫ ) k � = P � e λχ 2 ( k ) � e − λ (1+ ǫ ) k ≤ E (Markov) = (1 − 2 λ ) − k 2 e − λ (1+ ǫ ) k (MGF of χ 2 ( k ) for 0 ≤ λ ≤ 1 / 2) � (1 + ǫ ) e − ǫ � k / 2 = (take λ = ǫ/ 2(1 + ǫ )) ≤ e − ( ǫ 2 / 2 − ǫ 3 / 3 ) k / 2 (use log(1 + x ) ≤ x − x 2 / 2 + x 3 / 3) � � = n − 2 ǫ 2 / 2 − ǫ 3 / 3 (take k = 4 log( n )) Similarly we get � � f � 2 < (1 − ǫ ) � u � 2 � < n − 2 P 70 / 104

  65. Proof (3/3) Apply with u = x i − x j and use linearity of f to show that for an ( x i , x j ) pair, the probability of large distortion is ≤ 2 n − 2 Union bound: for all n ( n − 1) / 2 pairs, the probability that at least one has large distortion is smaller than n ( n − 1) × 2 n 2 = 1 − 1 � 2 n 71 / 104

  66. Scalability n = O (1 B ); d = O (1 M ) = ⇒ k = O (10 K ) Memory: need to store R , O ( dk ) ≈ 40 GB Computation: X × R in O ( ndk ) Other random matrices R have similar properties but better scalability, e.g.: ”add or subtract” [Achlioptas, 2003], 1 bit/entry, size ≈ 1 , 25 GB � +1 with probability 1 / 2 R ij = − 1 with probability 1 / 2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD , compute f ( x ) in O ( d log d ) 72 / 104

  67. Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 73 / 104

  68. Motivation JL random projec<on Kernel Phi d D R R k R Random features? 74 / 104

  69. Fourier feature space Example: Gaussian kernel � e − � x − x ′ � 2 1 R d e i ω ⊤ ( x − x ′ ) e − � ω � 2 = d ω 2 2 d (2 π ) 2 � � ω ⊤ ( x − x ′ ) = E ω cos � � � � �� ω ⊤ x ′ + b ω ⊤ x + b = E ω, b 2 cos cos with 1 e − � ω � 2 ω ∼ p ( d ω ) = d ω , b ∼ U ([0 , 2 π ]) . 2 d (2 π ) 2 This is of the form K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) with D = + ∞ : �� � � Φ : R d → L 2 R d , p ( d ω ) × ([0 , 2 π ] , U ) 75 / 104

  70. Random Fourier features [Rahimi and Recht, 2008] = × For i = 1 , . . . , k , sample randomly: = + ω ( ω i , b i ) ∼ p ( d ω ) × U ([0 , 2 π ]) ω ω π Create random features: ω ω � 2 � � ∀ x ∈ R d , ω ⊤ f i ( x ) = k cos i x + b i j x ω j ω T + x b j 76 / 104

  71. Random Fourier features [Rahimi and Recht, 2008] For any x , x ′ ∈ R d , it holds k � � � � � f ( x ) ⊤ f ( x ′ ) f i ( x ) f i ( x ′ ) E = E i =1 k = 1 � � � � �� � ω ⊤ x ′ + b ω ⊤ x + b 2 cos cos E k i =1 = K ( x , x ′ ) and by Hoeffding’s inequality, �� � � ≤ 2 e − k ǫ 2 � f ( x ) ⊤ f ( x ′ ) − K ( x , x ′ ) � � P � > ǫ 2 This allows to approximate learning with the Gaussian kernel with a simple linear model in k dimensions! 77 / 104

  72. Generalization A translation-invariant (t.i.) kernel is of the form K ( x , x ′ ) = ϕ ( x − x ′ ) Bochner’s theorem For a continuous function ϕ : R d → R , K is p.d. if and only if ϕ is the Fourier-Stieltjes transform of a symmetric and positive finite Borel � R d � measure µ ∈ M : � R d e − i ω ⊤ x d µ ( ω ) ϕ ( x ) = Just sample ω i ∼ d µ ( ω ) µ ( R d ) and b i ∼ U ([0 , 2 π ]) to approximate any t.i. kernel K with random features � 2 � � ω ⊤ k cos i x + b i 78 / 104

  73. Examples � R d e − i ω ⊤ ( x − x ′ ) d µ ( ω ) K ( x , x ′ ) = ϕ ( x − x ′ ) = Kernel ϕ ( x ) µ ( d ω ) � � � � − � x � 2 (2 π ) − d / 2 exp − � ω � 2 Gaussian exp 2 2 � k 1 exp ( −� x � 1 ) Laplace i =1 π ( 1+ ω 2 i ) � k 2 e −� ω � 1 Cauchy i =1 1+ x 2 i 79 / 104

  74. Performance [Rahimi and Recht, 2008] 80 / 104

Recommend


More recommend