Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - PowerPoint PPT Presentation

k -means example Iris dataset ● 0.4 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

k -means example Iris k−means, k = 2 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

k -means example Iris k−means, k = 3 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

k -means example Iris k−means, k = 4 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

k -means example Iris k−means, k = 5 ● 0.4 ● ● Cluster 1 ● Cluster 2 ● ● ● Cluster 3 ● ● ● ● ● ● Cluster 4 0.2 ● ● ● ● Cluster 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 ● ● ● ● ● ● ● ● ● ● −0.4 ● ● −2 −1 0 1 PC1 > irisCluster <- kmeans(log(iris[, 1:4]), 3, nstart = 20) > table(irisCluster$cluster, iris$Species) setosa versicolor virginica 1 0 48 4 2 50 0 0 3 0 2 46 26 / 104

k -means complexity Each update step: O ( nd ) Each assgnment step: O ( ndk ) 27 / 104

Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Conclusion 4 28 / 104

Motivation 12 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 1 2 3 4 5 x Predict a continuous output from an input 29 / 104

Model Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × R Fit a linear function: f β ( x ) = β ⊤ x Goodness of fit measured by residual sum of squares: n � ( y i − f β ( x i )) 2 RSS ( β ) = i =1 Ridge regression minimizes the regularized RSS: d � β 2 min β RSS ( β ) + λ i i =1 Solution (set gradient to 0): � � − 1 ˆ X ⊤ X + λ I X ⊤ Y β = 30 / 104

Ridge regression complexity Compute X ⊤ X : O ( nd 2 ) � � X ⊤ X + λ I : O ( d 3 ) Inverse Computing X ⊤ X is more expensive than inverting it! 31 / 104

Motivation Predict the category of a data 2 or more (sometimes many) categories 33 / 104

k -nearest neigbors (kNN) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o (Hastie et al. The elements of statistical learning. Springer, 2001.) Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } No training Given a new point x ∈ R d , predict the majority class among its k nearest neighbors (take k odd) 34 / 104

kNN properties Uniform Bayes consistency [Stone, 1977] Take k = √ n (for example) Let P be any distribution over ( X , Y ) pairs Assume training data are random pairs sampled i.i.d. according to P Then the k -NN classifier ˆ f n satisfies almost surely: n → + ∞ P (ˆ lim f ( X ) � = Y ) = fmeasurable P ( f ( X ) � = Y ) inf Complexity: Memory: story X is O ( nd ) Training time: 0 Prediction: O ( nd ) for each test point 35 / 104

Linear models for classification Training set S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ R d × {− 1 , 1 } Fit a linear function f β ( x ) = β ⊤ x The prediction on a new point x ∈ R d is: � +1 if f β ( x ) > 0 , − 1 otherwise. 36 / 104

Large-margin classifiers For any f : R d → R , the margin of f on an ( x , y ) pair is yf ( x ) Large-margin classifiers fit a classifier by maximizing the margins on the training set: n � ℓ ( y i f β ( x i )) + λβ ⊤ β min β i =1 for a convex, non-increasing loss function ℓ : R → R + 37 / 104

Loss function examples Loss Method ℓ ( u ) 1( u ≤ 0) 0-1 none Hinge Support vector machine (SVM) max (1 − u , 0) log (1 + e − u ) Logistic Logistic regression (1 − u ) 2 Square Ridge regression 38 / 104

Ridge logistic regression [Le Cessie and van Houwelingen, 1992] n � � � 1 + e − y i β ⊤ x i + λβ ⊤ β β ∈ R p J ( β ) = min ln i =1 Can be interpreted as a regularized conditional maximum likelihood estimator No explicit solution, but smooth convex optimization problem that can be solved numerically by Newton-Raphson iterations: � � β old �� − 1 � β old � β new ← β old − ∇ 2 ∇ β J . β J Each iteration amounts to solving a weighted ridge regression problem, hence the name iteratively reweighted least squares (IRLS). Complexity O ( iterations ∗ ( nd 2 + d 3 )) 39 / 104

SVM [Boser et al., 1992] n � � � 0 , 1 − y i β ⊤ x i + λβ ⊤ β min max β ∈ R p i =1 A non-smooth convex optimization problem (convex quadratic program) Equivalent to the dual problem 0 ≤ y i α i ≤ 1 α ∈ R n 2 α ⊤ Y − α ⊤ XX ⊤ α max s.t. 2 λ for i = 1 , . . . , n The solution β ∗ of the primal is obtained from the solution α ∗ of the dual: β ∗ = X ⊤ α ∗ f β ∗ ( x ) = ( β ∗ ) ⊤ x = ( α ∗ ) ⊤ Xx Training complexity: O ( n 2 ) to store XX ⊤ , O ( n 3 ) to find α ∗ Prediction: O ( d ) for ( β ∗ ) ⊤ x , O ( nd ) for ( α ∗ ) ⊤ Xx 40 / 104

Motivation ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x 42 / 104

Model Learn a function f : R d → R of the form n � f ( x ) = α i K ( x i , x ) i =1 For a positive definite (p.d.) kernel K : R d × R d → R , such as K ( x , x ′ ) = x ⊤ x ′ Linear � � p x ⊤ x ′ + c K ( x , x ′ ) = Polynomial � � x − x ′ � 2 � K ( x , x ′ ) = exp Gaussian 2 σ 2 d min( | x i | , | x ′ i | ) � K ( x , x ′ ) = Min/max max( | x i | , | x ′ i | ) i =1 43 / 104

Feature space A function K : R d × R d → R is a p.d. kernel if and only if there existe a mapping Φ : R d → R D , for some D ∈ N ∪ { + ∞} , such that ∀ x , x ′ ∈ R d , K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) f is then a linear function in R D : n n � � α i Φ( x i ) ⊤ Φ( x ) = β ⊤ Φ( x ) f ( x ) = α i K ( x i , x ) = i =1 i =1 for β = � n i =1 α i Φ( x i ). x1 2 x1 x2 R 2 x2 44 / 104

Learning 2 x1 x1 x2 R 2 x2 We can learn f ( x ) = � n i =1 α i K ( x i , x ) by fitting a linear model β ⊤ Φ( x ) in the feature space Example: ridge regression / logistic regression / SVM n � ℓ ( y i , β ⊤ Φ( x i )) + λβ ⊤ β min β ∈ R D i =1 But D can be very large, even infinite... 45 / 104

Kernel tricks K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) can be quick to compute even if D is large (even infinite) For a set of training samples { x 1 , . . . , x n } ⊂ R d let K n the n × n Gram matrix: [ K n ] ij = K ( x i , x j ) For β = � n i =1 α i Φ( x i ) we have β ⊤ Φ( x i ) = [ K α ] i β ⊤ β = α ⊤ K α and We can therefore solve the equivalent problem in α ∈ R n n � ℓ ( y i , [ K α ] i ) + λα ⊤ K α min α ∈ R n i =1 46 / 104

Example: kernel ridge regression (KRR) n � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β min β ∈ R d i =1 Solve in R D : � � − 1 ˆ Φ( X ) ⊤ Φ( X ) + λ I Φ( X ) ⊤ Y β = � �� D × D Solve in R n : α = ( K + λ I ) − 1 ˆ Y � �� n × n 47 / 104

KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 1000 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

KRR with Gaussian RBF kernel n � � x − x ′ � 2 � � � 2 � y i − β ⊤ Φ( x i ) + λβ ⊤ β K ( x , x ′ ) = exp min 2 σ 2 β ∈ R d i =1 lambda = 0.1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 48 / 104 x

Complexity lambda = 1 ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 10 x Compute K : O ( dn 2 ) Store K : O ( n 2 ) Solve α : O ( n 2 ∼ 3 ) Compute f ( x ) for one x : O ( nd ) Unpractical for n > 10 ∼ 100 k 49 / 104

Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression Classification: kNN, logistic regression and SVM Nonlinear models: kernel methods Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 50 / 104

Outline Introduction 1 Standard machine learning 2 Large-scale machine learning 3 Scalability issues The tradeoffs of large-scale learning Random projections Random features Approximate NN Shingling, hashing, sketching Conclusion 4 51 / 104

What is ”large-scale”? Data cannot fit in RAM Algorithm cannot run on a single machine in reasonable time (algorithm-dependent) Sometimes even O ( n ) is too large! (e.g., nearest neighbor in a database of O ( B +) items) Many tasks / parameters (e.g., image categorization in O (10 M ) classes) Streams of data 52 / 104

Things to worry about Training time (usually offline) Memory requirements Test time Complexities so far Method Memory Training time Test time O ( d 2 ) O ( nd 2 ) PCA O ( d ) k -means O ( nd ) O ( ndk ) O ( kd ) O ( d 2 ) O ( nd 2 ) Ridge regression O ( d ) kNN O ( nd ) 0 O ( nd ) O ( nd 2 ) Logistic regression O ( nd ) O ( d ) O ( n 2 ) O ( n 3 ) SVM, kernel methods O ( nd ) 53 / 104

Techniques for large-scale machine learning Good baselines: Subsample data and run standard method Split and run on several machines (depends on algorithm) Need to revisit standard algorithms and implementation, taking into account scalability Trade exactness for scalability Compress, sketch, hash data in a smart way 54 / 104

Motivation Classical learning theory analyzes the trade-off between: approximation error (how well we approximate the true function) estimation errors (how well we estimate the parameters) ℱ But reaching the best trade-off for a given n may be impossible with limited computational resources We should include in the trade-off the computational budget, and see which optimization algorithm gives the best trade-off! Seminal paper of Bottou and Bousquet [2008] 56 / 104

Classical ERM setting Goal: learn a function f : R d → Y ( Y = R or {− 1 , 1 } ) P unknown distribution over R d × Y Training set: S = { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } ⊂ R d × Y i.i.d. following P � f : R d → R � Fix a class of functions F ⊂ Choose a loss ℓ ( y , f ( x )) Learning by empirical risk minimization n f ∈F R n [ f ] = 1 � f n ∈ arg min ℓ ( Y i , f ( X i )) n i =1 Hope that f n has a small risk: R [ f n ] = E ℓ ( Y , f n ( X )) 57 / 104

Classical ERM setting The best possible risk is R ∗ = f : R d →Y R [ f ] min The best achievable risk over F is R ∗ F = min f ∈F R [ f ] We then have the decomposition R [ f n ] − R ∗ = R [ f n ] − R ∗ R ∗ F − R ∗ + F � �� estimation error ǫ est approximation errror ǫ app ℱ 58 / 104

Optimization error Solving the ERM problem may be hard (when n and d are large) Instead we usually find an approximate solution ˜ f n that satisfies R n [˜ f n ] ≤ R n [ f n ] + ρ The excess risk of ˜ f n is then f n ] − R ∗ = ǫ = R [˜ R [˜ f n ] − R [ f n ] + ǫ est + ǫ app � �� optimization error ǫ opt 59 / 104

A new trade-off ǫ = ǫ app + ǫ est + ǫ opt Problem Choose F , n , ρ to make ǫ as small as possible Subject to a limit on n and on the computation time T Table 1: Typical variations when F , n , and ρ increase. F n ρ E app (approximation error) ↘ E est (estimation error) ↗ ↘ E opt (optimization error) · · · · · · ↗ T (computation time) ↗ ↗ ↘ Large-scale or small-scale? Small-scale when constraint on n is active Large-scale when constraint on T is active 60 / 104

Comparing optimization methods n � β ∈B⊂ R d R n [ f β ] = min ℓ ( y i , f β ( x i )) i =1 Gradient descent (GD): β t +1 ← β t − η∂ R n ( f β t ) ∂β Second-order gradient descent (2GD), assuming Hessian H known β t +1 ← β t − H − 1 ∂ R n ( f β t ) ∂β Stochastic gradient descent (SGD): β t +1 ← β t − η ∂ℓ ( y t , f β t ( x t )) t ∂β 61 / 104

Results [Bottou and Bousquet, 2008] Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ρ accuracy ρ E ≤ c ( E app + ε ) � � � � � � d 2 κ κ log 1 nd κ log 1 ε 1 /α log 2 1 GD O ( nd ) O O O ρ ρ ε � � �� d 2 + nd � d 2 + nd � d 2 log log 1 log log 1 ε 1 /α log 1 ε log log 1 2GD O O O O ρ ρ ε � � � � � � νκ 2 dνκ 2 d ν κ 2 1 SGD O ( d ) ρ + o O O ρ ρ ε � � � � � � � � 2SGD α ∈ [1 / 2 , 1] comes from the bound on ε est and depends on the data In the last column, n and ρ are optimized to reach ǫ for each method 2GD optimizes much faster than GD, but limited gain on the final performance limited by ǫ − 1 /α coming from the estimation error SGD: Optimization speed is catastrophic Learning speed is the best, and independent of α This suggests that SGD is very competitive (and has become the de facto standard in large-scale ML) 62 / 104

Illustration https://bigdata2013.sciencesconf.org/conference/bigdata2013/pages/bottou.pdf 63 / 104

Motivation Affects scalability of algorithms, e.g., O ( nd ) for kNN or O ( d 3 ) for ridge regression Hard to visualize (Sometimes) counterintuitive phenomena in high dimension, e.g., concentration of measure for Gaussian data d=1 d=10 d=100 400 250 150 300 200 Frequency Frequency Frequency 150 100 200 100 100 50 50 0 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ||x||/sqrt(k) ||x||/sqrt(k) ||x||/sqrt(k) Statistical inference degrades when d increases (curse of dimension) 65 / 104

Dimension reduction with PCA PC1 PC2 Projects data onto k < d dimensions that captures the largest amount of variance Also minimizes total reconstruction errors: n � � x i − Π S k ( x i ) � 2 min S k i =1 But computational expensive: O ( nd 2 ) No theoretical garantee on distance preservation 66 / 104

Linear dimension reduction X ′ = X × R �� n × k n × d d × k Can we find R efficiently? Can we preserve distances? ∀ i , j = 1 , . . . , n , � f ( x i ) − f ( x j ) � ≈ � x i − x j � Note: when d > n , we can take k = n and preserve all distances exactly (kernel trick) 67 / 104

Random projections Simply take a random projection matrix: 1 R ⊤ x √ R ij ∼ N (0 , 1) f ( x ) = with k Theorem [Johnson and Lindenstrauss, 1984] For any ǫ > 0 and n ∈ N , take � − 1 log( n ) ≈ ǫ − 2 log( n ) . � ǫ 2 / 2 − ǫ 3 / 3 k ≥ 4 Then the following holds with probabiliy at least 1 − 1 / n : (1 − ǫ ) � x i − x j � 2 ≤ � f ( x i ) − f ( x j ) � 2 ≤ (1+ ǫ ) � x i − x j � 2 ∀ i , j = 1 , . . . , n k does not depend on d ! n = 1 M , ǫ = 0 . 1 = ⇒ k ≈ 5 K n = 1 B , ǫ = 0 . 1 = ⇒ k ≈ 8 K 68 / 104

Proof (1/3) For a single dimension, q j = r ⊤ j u : E ( q j ) = E ( r j ) ⊤ u = 0 E ( q j ) 2 = u ⊤ E ( r j r ⊤ j ) u = � u � 2 √ kR ⊤ u : For the k -dimensional projection f ( u ) = 1 / k j ∼ � u � 2 � f ( u ) � 2 = 1 � q 2 χ 2 ( k ) k k j =1 k E � f ( u ) � 2 = 1 � E ( q 2 j ) = � u � 2 k j =1 Need to show that � f ( u ) � 2 is concentrated around its mean 69 / 104

Proof (2/3) � � f � 2 > (1 + ǫ ) � u � 2 � P � � χ 2 ( k ) > (1 + ǫ ) k = P � e λχ 2 ( k ) > e λ (1+ ǫ ) k � = P � e λχ 2 ( k ) � e − λ (1+ ǫ ) k ≤ E (Markov) = (1 − 2 λ ) − k 2 e − λ (1+ ǫ ) k (MGF of χ 2 ( k ) for 0 ≤ λ ≤ 1 / 2) � (1 + ǫ ) e − ǫ � k / 2 = (take λ = ǫ/ 2(1 + ǫ )) ≤ e − ( ǫ 2 / 2 − ǫ 3 / 3 ) k / 2 (use log(1 + x ) ≤ x − x 2 / 2 + x 3 / 3) � � = n − 2 ǫ 2 / 2 − ǫ 3 / 3 (take k = 4 log( n )) Similarly we get � � f � 2 < (1 − ǫ ) � u � 2 � < n − 2 P 70 / 104

Proof (3/3) Apply with u = x i − x j and use linearity of f to show that for an ( x i , x j ) pair, the probability of large distortion is ≤ 2 n − 2 Union bound: for all n ( n − 1) / 2 pairs, the probability that at least one has large distortion is smaller than n ( n − 1) × 2 n 2 = 1 − 1 � 2 n 71 / 104

Scalability n = O (1 B ); d = O (1 M ) = ⇒ k = O (10 K ) Memory: need to store R , O ( dk ) ≈ 40 GB Computation: X × R in O ( ndk ) Other random matrices R have similar properties but better scalability, e.g.: ”add or subtract” [Achlioptas, 2003], 1 bit/entry, size ≈ 1 , 25 GB � +1 with probability 1 / 2 R ij = − 1 with probability 1 / 2 Fast Johnson-Lindenstrauss transform [Ailon and Chazelle, 2009] where R = PHD , compute f ( x ) in O ( d log d ) 72 / 104

Motivation JL random projec<on Kernel Phi d D R R k R Random features? 74 / 104

Fourier feature space Example: Gaussian kernel � e − � x − x ′ � 2 1 R d e i ω ⊤ ( x − x ′ ) e − � ω � 2 = d ω 2 2 d (2 π ) 2 � � ω ⊤ ( x − x ′ ) = E ω cos � � � � �� ω ⊤ x ′ + b ω ⊤ x + b = E ω, b 2 cos cos with 1 e − � ω � 2 ω ∼ p ( d ω ) = d ω , b ∼ U ([0 , 2 π ]) . 2 d (2 π ) 2 This is of the form K ( x , x ′ ) = Φ( x ) ⊤ Φ( x ′ ) with D = + ∞ : �� Φ : R d → L 2 R d , p ( d ω ) × ([0 , 2 π ] , U ) 75 / 104

Random Fourier features [Rahimi and Recht, 2008] = × For i = 1 , . . . , k , sample randomly: = + ω ( ω i , b i ) ∼ p ( d ω ) × U ([0 , 2 π ]) ω ω π Create random features: ω ω � 2 � � ∀ x ∈ R d , ω ⊤ f i ( x ) = k cos i x + b i j x ω j ω T + x b j 76 / 104

Random Fourier features [Rahimi and Recht, 2008] For any x , x ′ ∈ R d , it holds k � � � � � f ( x ) ⊤ f ( x ′ ) f i ( x ) f i ( x ′ ) E = E i =1 k = 1 � � � � �� ω ⊤ x ′ + b ω ⊤ x + b 2 cos cos E k i =1 = K ( x , x ′ ) and by Hoeffding’s inequality, �� ≤ 2 e − k ǫ 2 � f ( x ) ⊤ f ( x ′ ) − K ( x , x ′ ) � � P � > ǫ 2 This allows to approximate learning with the Gaussian kernel with a simple linear model in k dimensions! 77 / 104

Generalization A translation-invariant (t.i.) kernel is of the form K ( x , x ′ ) = ϕ ( x − x ′ ) Bochner’s theorem For a continuous function ϕ : R d → R , K is p.d. if and only if ϕ is the Fourier-Stieltjes transform of a symmetric and positive finite Borel � R d � measure µ ∈ M : � R d e − i ω ⊤ x d µ ( ω ) ϕ ( x ) = Just sample ω i ∼ d µ ( ω ) µ ( R d ) and b i ∼ U ([0 , 2 π ]) to approximate any t.i. kernel K with random features � 2 � � ω ⊤ k cos i x + b i 78 / 104

Examples � R d e − i ω ⊤ ( x − x ′ ) d µ ( ω ) K ( x , x ′ ) = ϕ ( x − x ′ ) = Kernel ϕ ( x ) µ ( d ω ) � � � � − � x � 2 (2 π ) − d / 2 exp − � ω � 2 Gaussian exp 2 2 � k 1 exp ( −� x � 1 ) Laplace i =1 π ( 1+ ω 2 i ) � k 2 e −� ω � 1 Cauchy i =1 1+ x 2 i 79 / 104

Performance [Rahimi and Recht, 2008] 80 / 104

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - PowerPoint PPT Presentation

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 104 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Quantum information for fundamental physics Daniel Carney JQI, U. Maryland QuICS, NIST (Venn

Sma Smart rt Le Learn rning Ci Cities F Foru orum 26 February 2020 A learning ci city

Lead-ins and refrains Telugu songs lead into the refrain in many ways K. V. S. Prasad

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation

Welcome to enterprise-class big data and financial an Putting big data and advanced analytics to

Advocacy Resources & Ideas Advocacy: Community Connections Talking up your library

2020 Census: Key Facts for 2020 Census: Key Facts for Libraries Libraries Many thanks Many

22/05/2011 Object of this Course

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ - PowerPoint PPT Presentation

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens } .fr 1 / 104 Outline Introduction 1 Standard machine learning 2 Dimension reduction: PCA Clustering: k -means Regression: ridge regression

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science,

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Quantum information for fundamental physics Daniel Carney JQI, U. Maryland QuICS, NIST (Venn

Sma Smart rt Le Learn rning Ci Cities F Foru orum 26 February 2020 A learning ci city

Lead-ins and refrains Telugu songs lead into the refrain in many ways K. V. S. Prasad

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation

Welcome to enterprise-class big data and financial an Putting big data and advanced analytics to

Advocacy Resources &amp; Ideas Advocacy: Community Connections Talking up your library

2020 Census: Key Facts for 2020 Census: Key Facts for Libraries Libraries Many thanks Many

22/05/2011 Object of this Course

Advocacy Resources & Ideas Advocacy: Community Connections Talking up your library