Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43
Scalable machine learning algorithms with provable guarantees In this talk: towards scalable numerical linear algebra in kernel spaces with provable guarantees 2 / 43
Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f 3 / 43
Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f 3 / 43
Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f Solve least squares problem: n � | x j α − y j | 2 + λ || α || 2 min 2 α ∈ R d j = 1 3 / 43
Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 4 / 43
Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 2 1.5 1 0.5 0 -0.5 -1 -1.5 True Function Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 4 / 43
Choose an embedding into a high dimensional feature space Ψ : R → R D Dimension D may be infinite (e.g. Gaussian kernel). Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 5 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 6 x 8 x 9 x 10 x 5 x 7 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43
Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 6 / 43
Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 7 / 43
Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 8 / 43
Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 After algebraic manipulations α ∗ = Ψ T ( K + λ I ) − 1 y 8 / 43
Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y 9 / 43
Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y ∞ n = n K ∞ Ψ ( x j ) The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 9 / 43
How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 10 / 43
How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) 10 / 43
How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! 10 / 43
How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! Can compute ( ZZ T + λ I ) − 1 y in O ( ns 2 ) time and O ( ns ) space! 10 / 43
Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. 11 / 43
Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. Let p ( η ) : = � k ( η ) . Then for every x a , x b � k ( η ) e − 2 π i ( x a − x b ) η d η � K ab = k ( x a − x b ) = R � R e − 2 π i ( x a − x b ) η p ( η ) d η = � e − 2 π i ( x a − x b ) η � = E η ∼ p ( η ) 11 / 43
Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η 12 / 43
Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 12 / 43
Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 13 / 43
Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 14 / 43
Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 14 / 43
Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 15 / 43
Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 15 / 43
Fourier Features: sampling columns of Fourier factorization of K s Z T ≈ n Z K Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! One has E [ ZZ T ] = K 16 / 43
Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? 17 / 43
Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? 17 / 43
Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? Known for the polynomial kernel only: Avron et al., NIPS’2014 via T ENSOR S KETCH 17 / 43
Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. 18 / 43
Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. That way E [ ZZ T ] = K 18 / 43
Recommend
More recommend