k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris k−means, k = 2 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris k−means, k = 3 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris k−means, k = 4 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means example Iris k−means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104
k -means complexity Each update step: O ( nd ) Each assgnment step: O ( ndk ) 27 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 28 / 104
Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104
Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104
Model Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × R Fit a linear function: f β ( x ) = β ⊤ x Goodness of fit measured by residual sum of squares: n � ( y i − f β ( x i )) 2 RSS ( β ) = i =1 Ridge regression minimizes the regularized RSS: d � β 2 min β RSS ( β ) + λ i i =1 Solution (set gradient to 0): � � − 1 ˆ X ⊤ X + λ I X ⊤ Y β = 30 / 104
Ridge regression complexity Compute X ⊤ X : O ( nd 2 ) � � X ⊤ X + λ I : O ( d 3 ) Inverse Computing X ⊤ X is more expensive than inverting it! 31 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 32 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104
k -nearest neigbors (kNN) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o (Hastie et al. The elements of statistical learning. Springer, 2001.) Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } No training Given a new point x ∈ R d , predict the majority class among its k nearest neighbors (take k odd) 34 / 104
kNN properties Uniform Bayes consistency [Stone, 1977] Take k = √ n (for example) Let P be any distribution over ( X , Y ) pairs Assume training data are random pairs sampled i.i.d. according to P Then the k -NN classifier ˆ f n satisfies almost surely: n → + ∞ P (ˆ lim f ( X ) � = Y ) = fmeasurable P ( f ( X ) � = Y ) inf Complexity: Memory: story X is O ( nd ) Training time: 0 Prediction: O ( nd ) for each test point 35 / 104
Linear models for classification Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } Fit a linear function f β ( x ) = β ⊤ x The prediction on a new point x ∈ R d is: � +1 if f β ( x ) > 0 , − 1 otherwise. 36 / 104
Large-margin classifiers For any f : R d → R , the margin of f on an ( x , y ) pair is yf ( x ) Large-margin classifiers fit a classifier by maximizing the margins on the training set: n � ℓ ( y i f β ( x i )) + λβ ⊤ β min β i =1 for a convex, non-increasing loss function ℓ : R → R + 37 / 104
Loss function examples Loss Method ℓ ( u ) 1( u ≤ 0) 0-1 none Hinge Support vector machine (SVM) max (1 − u , 0) log (1 + e − u ) Logistic Logistic regression (1 − u ) 2 Square Ridge regression 38 / 104
Ridge logistic regression [Le Cessie and van Houwelingen, 1992] n � � � 1 + e − y i β ⊤ x i + λβ ⊤ β β ∈ R p J ( β ) = min ln i =1 Can be interpreted as a regularized conditional maximum likelihood estimator No explicit solution, but smooth convex optimization problem that can be solved numerically by Newton-Raphson iterations: � � β old �� − 1 � β old � β new ← β old − ∇ 2 ∇ β J . β J Each iteration amounts to solving a weighted ridge regression problem, hence the name iteratively reweighted least squares (IRLS). Complexity O ( iterations ∗ ( nd 2 + d 3 )) 39 / 104
SVM [Boser et al., 1992] n � � � 0 , 1 − y i β ⊤ x i + λβ ⊤ β min max β ∈ R p i =1 A non-smooth convex optimization problem (convex quadratic program) Equivalent to the dual problem 0 ≤ y i α i ≤ 1 α ∈ R n 2 α ⊤ Y − α ⊤ XX ⊤ α max s.t. 2 λ for i = 1 , . . . , n The solution β ∗ of the primal is obtained from the solution α ∗ of the dual: β ∗ = X ⊤ α ∗ f β ∗ ( x ) = ( β ∗ ) ⊤ x = ( α ∗ ) ⊤ Xx Training complexity: O ( n 2 ) to store XX ⊤ , O ( n 3 ) to find α ∗ Prediction: O ( d ) for ( β ∗ ) ⊤ x , O ( nd ) for ( α ∗ ) ⊤ Xx 40 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 41 / 104
Motivation ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x 42 / 104
Model Learn a function f : R d → R of the form n � f ( x ) = α i K ( x i , x ) i =1 For a positive definite (p.d.) kernel K : R d × R d → R , such as K ( x , x ′ ) = x ⊤ x ′ Linear � � p x ⊤ x ′ + c K ( x , x ′ ) = Polynomial � � x − x ′ � 2 � K ( x , x ′ ) = exp Gaussian 2 σ 2 d min( | x i | , | x ′ i | ) � K ( x , x ′ ) = Min/max max( | x i | , | x ′ i | ) i =1 43 / 104
Feature space A function K : R d × R d → R is a p.d. kernel if and only if there existe a mapping Φ : R d → R D , for some D ∈ N ∪ { + ∞} , such that ∀ x , x ′ ∈ R d , K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) f is then a linear function in R D : n n � � α i Φ( x i ) ⊤ Φ( x ) = β ⊤ Φ( x ) f ( x ) = α i K ( x i , x ) = i =1 i =1 for β = � n i =1 α i Φ( x i ). x1 2 x1 x2 R 2 x2 44 / 104
Learning 2 x1 x1 x2 R 2 x2 We can learn f ( x ) = � n i =1 α i K ( x i , x ) by fitting a linear model β ⊤ Φ( x ) in the feature space Example: ridge regression / logistic regression / SVM n � ℓ ( y i , β ⊤ Φ( x i )) + λβ ⊤ β min β ∈ R D i =1 But D can be very large, even infinite... 45 / 104
Kernel tricks K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) can be quick to compute even if D is large (even infinite) For a set of training samples { x 1 , . . . , x n } ⊂ R d let K n the n × n Gram matrix: [ K n ] ij = K ( x i , x j ) For β = � n i =1 α i Φ( x i ) we have β ⊤ Φ( x i ) = [ K α ] i β ⊤ β = α ⊤ K α and We can therefore solve the equivalent problem in α ∈ R n n � ℓ ( y i , [ K α ] i ) + λα ⊤ K α min α ∈ R n i =1 46 / 104
Example: kernel ridge regression (KRR) n � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β min β ∈ R d i =1 Solve in R D : � � − 1 ˆ Φ( X ) ⊤ Φ( X ) + λ I Φ( X ) ⊤ Y β = � �� � D × D Solve in R n : α = ( K + λ I ) − 1 ˆ Y � �� � n × n 47 / 104
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 1000 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 100 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 10 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.01 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.0001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.00001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.0000001 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x
Complexity lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x Compute K : O ( dn 2 ) Store K : O ( n 2 ) Solve α : O ( n 2 ∼ 3 ) Compute f ( x ) for one x : O ( nd ) Unpractical for n > 10 ∼ 100 k 49 / 104
Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 50 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 51 / 104
What is ”large-scale”? Data cannot fit in RAM Algorithm cannot run on a single machine in reasonable time (algorithm-dependent) Sometimes even O ( n ) is too large! (e.g., nearest neighbor in a database of O ( B +) items) Many tasks / parameters (e.g., image categorization in O (10 M ) classes) Streams of data 52 / 104
Things to worry about Training time (usually offline) Memory requirements Test time Complexities so far Method Memory Training time Test time O ( d 2 ) O ( nd 2 ) PCA O ( d ) k -means O ( nd ) O ( ndk ) O ( kd ) O ( d 2 ) O ( nd 2 ) Ridge regression O ( d ) kNN O ( nd ) 0 O ( nd ) O ( nd 2 ) Logistic regression O ( nd ) O ( d ) O ( n 2 ) O ( n 3 ) SVM, kernel methods O ( nd ) 53 / 104
Techniques for large-scale machine learning Good baselines: Subsample data and run standard method Split and run on several machines (depends on algorithm) Need to revisit standard algorithms and implementation, taking into account scalability Trade exactness for scalability Compress, sketch, hash data in a smart way 54 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 55 / 104
Motivation Classical learning theory analyzes the trade-off between: approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters) ℱ But reaching the best trade-off for a given n may be impossible with limited computational resources We should include in the trade-off the computational budget, and see which optimization algorithm gives the best trade-off! Seminal paper of Bottou and Bousquet [2008] 56 / 104
Classical ERM setting Goal: learn a function f : R d → Y ( Y = R or {− 1 , 1 } ) P unknown distribution over R d × Y Training set: S = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } ⊂ R d × Y i.i.d. following P � f : R d → R � Fix a class of functions F ⊂ Choose a loss ℓ ( y , f ( x )) Learning by empirical risk minimization n f ∈F R n [ f ] = 1 � f n ∈ arg min ℓ ( Y i , f ( X i )) n i =1 Hope that f n has a small risk: R [ f n ] = E ℓ ( Y , f n ( X )) 57 / 104
Classical ERM setting The best possible risk is R ∗ = f : R d →Y R [ f ] min The best achievable risk over F is R ∗ F = min f ∈F R [ f ] We then have the decomposition R [ f n ] − R ∗ = R [ f n ] − R ∗ R ∗ F − R ∗ + F � �� � � �� � estimation error ǫ est approximation errror ǫ app ℱ 58 / 104
Optimization error Solving the ERM problem may be hard (when n and d are large) Instead we usually find an approximate solution ˜ f n that satisfies R n [˜ f n ] ≤ R n [ f n ] + ρ The excess risk of ˜ f n is then f n ] − R ∗ = ǫ = R [˜ R [˜ f n ] − R [ f n ] + ǫ est + ǫ app � �� � optimization error ǫ opt 59 / 104
A new trade-off ǫ = ǫ app + ǫ est + ǫ opt Problem Choose F , n , ρ to make ǫ as small as possible Subject to a limit on n and on the computation time T Table 1: Typical variations when F , n , and ρ increase. F n ρ E app (approximation error) ↘ E est (estimation error) ↗ ↘ E opt (optimization error) · · · · · · ↗ T (computation time) ↗ ↗ ↘ Large-scale or small-scale? Small-scale when constraint on n is active Large-scale when constraint on T is active 60 / 104
Comparing optimization methods n � β ∈B⊂ R d R n [ f β ] = min ℓ ( y i , f β ( x i )) i =1 Gradient descent (GD): β t +1 ← β t − η∂ R n ( f β t ) ∂β Second-order gradient descent (2GD), assuming Hessian H known β t +1 ← β t − H − 1 ∂ R n ( f β t ) ∂β Stochastic gradient descent (SGD): β t +1 ← β t − η ∂ℓ ( y t , f β t ( x t )) t ∂β 61 / 104
Results [Bottou and Bousquet, 2008] Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E ≤ c ( E app + ε ) � � � � � � d 2 κ κ log 1 nd κ log 1 ε 1 /α log 2 1 GD O ( nd ) O O O ρ ρ ε � � �� � � � � d 2 + nd � d 2 + nd � d 2 log log 1 log log 1 ε 1 /α log 1 ε log log 1 2GD O O O O ρ ρ ε � � � � � � νκ 2 dνκ 2 d ν κ 2 1 SGD O ( d ) ρ + o O O ρ ρ ε � � � � � � � � 2SGD α ∈ [1 / 2 , 1] comes from the bound on ε est and depends on the data In the last column, n and ρ are optimized to reach ǫ for each method 2GD optimizes much faster than GD, but limited gain on the final performance limited by ǫ − 1 /α coming from the estimation error SGD: Optimization speed is catastrophic Learning speed is the best, and independent of α This suggests that SGD is very competitive (and has become the de facto standard in large-scale ML) 62 / 104
Illustration https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 64 / 104
Motivation Affects scalability of algorithms, e.g., O ( nd ) for kNN or O ( d 3 ) for ridge regression Hard to visualize (Sometimes) counterintuitive phenomena in high dimension, e.g., concentration of measure for Gaussian data d=1 d=10 d=100 400 250 150 300 200 Frequency Frequency Frequency 150 100 200 100 100 50 50 0 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ||x||/sqrt(k) ||x||/sqrt(k) ||x||/sqrt(k) Statistical inference degrades when d increases (curse of dimension) 65 / 104
Dimension reduction with PCA PC1 PC2 Projects data onto k < d dimensions that captures the largest amount of variance Also minimizes total reconstruction errors: n � � x i − Π S k ( x i ) � 2 min S k i =1 But computational expensive: O ( nd 2 ) No theoretical garantee on distance preservation 66 / 104
Linear dimension reduction X ′ = X × R ���� ���� ���� n × k n × d d × k Can we find R efficiently? Can we preserve distances? ∀ i , j = 1 , . . . , n , � f ( x i ) − f ( x j ) � ≈ � x i − x j � Note: when d > n , we can take k = n and preserve all distances exactly (kernel trick) 67 / 104
Random projections Simply take a random projection matrix: 1 R ⊤ x √ R ij ∼ N (0 , 1) f ( x ) = with k Theorem [Johnson and Lindenstrauss, 1984] For any ǫ > 0 and n ∈ N , take � − 1 log( n ) ≈ ǫ − 2 log( n ) . � ǫ 2 / 2 − ǫ 3 / 3 k ≥ 4 Then the following holds with probabiliy at least 1 − 1 / n : (1 − ǫ ) � x i − x j � 2 ≤ � f ( x i ) − f ( x j ) � 2 ≤ (1+ ǫ ) � x i − x j � 2 ∀ i , j = 1 , . . . , n k does not depend on d ! n = 1 M , ǫ = 0 . 1 = ⇒ k ≈ 5 K n = 1 B , ǫ = 0 . 1 = ⇒ k ≈ 8 K 68 / 104
Proof (1/3) For a single dimension, q j = r ⊤ j u : E ( q j ) = E ( r j ) ⊤ u = 0 E ( q j ) 2 = u ⊤ E ( r j r ⊤ j ) u = � u � 2 √ kR ⊤ u : For the k -dimensional projection f ( u ) = 1 / k j ∼ � u � 2 � f ( u ) � 2 = 1 � q 2 χ 2 ( k ) k k j =1 k E � f ( u ) � 2 = 1 � E ( q 2 j ) = � u � 2 k j =1 Need to show that � f ( u ) � 2 is concentrated around its mean 69 / 104
Proof (2/3) � � f � 2 > (1 + ǫ ) � u � 2 � P � � χ 2 ( k ) > (1 + ǫ ) k = P � e λχ 2 ( k ) > e λ (1+ ǫ ) k � = P � e λχ 2 ( k ) � e − λ (1+ ǫ ) k ≤ E (Markov) = (1 − 2 λ ) − k 2 e − λ (1+ ǫ ) k (MGF of χ 2 ( k ) for 0 ≤ λ ≤ 1 / 2) � (1 + ǫ ) e − ǫ � k / 2 = (take λ = ǫ/ 2(1 + ǫ )) ≤ e − ( ǫ 2 / 2 − ǫ 3 / 3 ) k / 2 (use log(1 + x ) ≤ x − x 2 / 2 + x 3 / 3) � � = n − 2 ǫ 2 / 2 − ǫ 3 / 3 (take k = 4 log( n )) Similarly we get � � f � 2 < (1 − ǫ ) � u � 2 � < n − 2 P 70 / 104
Proof (3/3) Apply with u = x i − x j and use linearity of f to show that for an ( x i , x j ) pair, the probability of large distortion is ≤ 2 n − 2 Union bound: for all n ( n − 1) / 2 pairs, the probability that at least one has large distortion is smaller than n ( n − 1) × 2 n 2 = 1 − 1 � 2 n 71 / 104
Scalability n = O (1 B ); d = O (1 M ) = ⇒ k = O (10 K ) Memory: need to store R , O ( dk ) ≈ 40 GB Computation: X × R in O ( ndk ) Other random matrices R have similar properties but better scalability, e.g.: ”add or subtract” [Achlioptas, 2003], 1 bit/entry, size ≈ 1 , 25 GB � +1 with probability 1 / 2 R ij = − 1 with probability 1 / 2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD , compute f ( x ) in O ( d log d ) 72 / 104
Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 73 / 104
Motivation JL random projec<on Kernel Phi d D R R k R Random features? 74 / 104
Fourier feature space Example: Gaussian kernel � e − � x − x ′ � 2 1 R d e i ω ⊤ ( x − x ′ ) e − � ω � 2 = d ω 2 2 d (2 π ) 2 � � ω ⊤ ( x − x ′ ) = E ω cos � � � � �� ω ⊤ x ′ + b ω ⊤ x + b = E ω, b 2 cos cos with 1 e − � ω � 2 ω ∼ p ( d ω ) = d ω , b ∼ U ([0 , 2 π ]) . 2 d (2 π ) 2 This is of the form K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) with D = + ∞ : �� � � Φ : R d → L 2 R d , p ( d ω ) × ([0 , 2 π ] , U ) 75 / 104
Random Fourier features [Rahimi and Recht, 2008] = × For i = 1 , . . . , k , sample randomly: = + ω ( ω i , b i ) ∼ p ( d ω ) × U ([0 , 2 π ]) ω ω π Create random features: ω ω � 2 � � ∀ x ∈ R d , ω ⊤ f i ( x ) = k cos i x + b i j x ω j ω T + x b j 76 / 104
Random Fourier features [Rahimi and Recht, 2008] For any x , x ′ ∈ R d , it holds k � � � � � f ( x ) ⊤ f ( x ′ ) f i ( x ) f i ( x ′ ) E = E i =1 k = 1 � � � � �� � ω ⊤ x ′ + b ω ⊤ x + b 2 cos cos E k i =1 = K ( x , x ′ ) and by Hoeffding’s inequality, �� � � ≤ 2 e − k ǫ 2 � f ( x ) ⊤ f ( x ′ ) − K ( x , x ′ ) � � P � > ǫ 2 This allows to approximate learning with the Gaussian kernel with a simple linear model in k dimensions! 77 / 104
Generalization A translation-invariant (t.i.) kernel is of the form K ( x , x ′ ) = ϕ ( x − x ′ ) Bochner’s theorem For a continuous function ϕ : R d → R , K is p.d. if and only if ϕ is the Fourier-Stieltjes transform of a symmetric and positive finite Borel � R d � measure µ ∈ M : � R d e − i ω ⊤ x d µ ( ω ) ϕ ( x ) = Just sample ω i ∼ d µ ( ω ) µ ( R d ) and b i ∼ U ([0 , 2 π ]) to approximate any t.i. kernel K with random features � 2 � � ω ⊤ k cos i x + b i 78 / 104
Examples � R d e − i ω ⊤ ( x − x ′ ) d µ ( ω ) K ( x , x ′ ) = ϕ ( x − x ′ ) = Kernel ϕ ( x ) µ ( d ω ) � � � � − � x � 2 (2 π ) − d / 2 exp − � ω � 2 Gaussian exp 2 2 � k 1 exp ( −� x � 1 ) Laplace i =1 π ( 1+ ω 2 i ) � k 2 e −� ω � 1 Cauchy i =1 1+ x 2 i 79 / 104
Performance [Rahimi and Recht, 2008] 80 / 104
Recommend
More recommend