Random Fourier Features for Kernel Ridge Regression Michael Kapralov - PowerPoint PPT Presentation

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43

Scalable machine learning algorithms with provable guarantees In this talk: towards scalable numerical linear algebra in kernel spaces with provable guarantees 2 / 43

Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f 3 / 43

Linear regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: linear approximation to f Solve least squares problem: n � | x j α − y j | 2 + λ || α || 2 min 2 α ∈ R d j = 1 3 / 43

Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 4 / 43

Kernel ridge regression Input: � a sequence of d -dimensional data points x 1 ,..., x n ∈ R d � values y j = f ( x j ) , j = 1 ,..., n Output: approximation from class of ‘smooth’ functions on R d 2 1.5 1 0.5 0 -0.5 -1 -1.5 True Function Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 4 / 43

Choose an embedding into a high dimensional feature space Ψ : R → R D Dimension D may be infinite (e.g. Gaussian kernel). Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 5 / 43

Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 6 x 8 x 9 x 10 x 5 x 7 6 / 43

Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 6 / 43

Choose an embedding into a high dimensional feature space 1 ( 2 π ) 1 / 4 e − ( ·− x ) 2 / 4 Ψ : x → x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 6 / 43

Solve least squares problem: � n | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 -1.5 Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 7 / 43

Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 8 / 43

Solve least squares problem: n � | Ψ ( x j ) α − y j | 2 + λ || α || 2 min 2 α ∈ R D j = 1 2 1.5 1 0.5 0 -0.5 -1 True Function -1.5 Estimator Data -2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 After algebraic manipulations α ∗ = Ψ T ( K + λ I ) − 1 y 8 / 43

Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y 9 / 43

Kernel ridge regression Main computational effort: ( K + λ I ) − 1 y ∞ n = n K ∞ Ψ ( x j ) The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 9 / 43

How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 10 / 43

How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) 10 / 43

How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! 10 / 43

How quickly can we compute ( K + λ I ) − 1 y ? The ( i , j ) -th entry of Gaussian kernel matrix K is K ij = e − ( x i − x j ) 2 / 2 n 3 (or n ω ) in full generality... Ω ( n 2 ) time needed when λ = 0 assuming SETH Backurs-Indyk-Schmidt (NIPS’17) In practice: find Z ∈ R n × s , s ≪ n such that K ≈ ZZ T and use ZZ T + λ I as a proxy for K + λ I ! Can compute ( ZZ T + λ I ) − 1 y in O ( ns 2 ) time and O ( ns ) space! 10 / 43

Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. 11 / 43

Fourier Features Theorem (Bochner’s Theorem) A normalized continuous function k : R → R is a shift-invariant kernel if and only if its Fourier transform � k is a measure. Let p ( η ) : = � k ( η ) . Then for every x a , x b � k ( η ) e − 2 π i ( x a − x b ) η d η � K ab = k ( x a − x b ) = R � R e − 2 π i ( x a − x b ) η p ( η ) d η = � e − 2 π i ( x a − x b ) η � = E η ∼ p ( η ) 11 / 43

Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η 12 / 43

Fourier Features ∞ n = A n K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 12 / 43

Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Rahimi-Recht’2007: fix s , sample i.i.d. η 1 ,..., η s ∼ p ( η ) Let j -th row of Z be Z j , k : = 1 � se − 2 π ix j η k (samples of pure frequency x j ) and use ZZ T as a proxy for K ! 13 / 43

Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 14 / 43

Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 14 / 43

Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η 15 / 43

Fourier Features: sampling columns of Fourier factorization of K ∞ n = n A K � A T ∞ p ( η ) e − 2 π ix j η Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! 15 / 43

Fourier Features: sampling columns of Fourier factorization of K s Z T ≈ n Z K Column η has ℓ 2 2 norm n · p ( η ) ! Fourier features = sampling columns of A with probability proportional to column norms squared! One has E [ ZZ T ] = K 16 / 43

Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? 17 / 43

Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? 17 / 43

Spectral approximations ∞ n = n K � ∞ p ( η ) e − 2 π ix j η Our goal: find Z ∈ R n × s , s ≪ n such that ( 1 − ε )( K + λ I ) ≺ ZZ T + λ I ≺ ( 1 + ε )( K + λ I )? Subspace embeddings for kernel matrices that can be applied implicitly to points x 1 ,..., x n ∈ R d ? Known for the polynomial kernel only: Avron et al., NIPS’2014 via T ENSOR S KETCH 17 / 43

Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. 18 / 43

Spectral approximation via column sampling D n = n A K D A T For each j = 1 ,..., D compute sampling probability τ ( j ) Sample s columns independently from distribution τ , include j 1 in Z with weight � s · τ ( j ) if sampled. That way E [ ZZ T ] = K 18 / 43

Random Fourier Features for Kernel Ridge Regression Michael Kapralov - PowerPoint PPT Presentation

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43 Scalable machine learning algorithms with provable guarantees In this talk: towards

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Fourier Transform for Partial Differential Equations Introduction: Fourier Transform

Signals and Systems Chapter 4: The Continuous Time Fourier Transform Derivation of the CT Fourier

Performance Guarantees for Random Fourier Features Limitations and Merits Zolt an Szab

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

and its Applications Karl Rupp karlirupp@hotmail.com Fourier Transform p.1/22 Content

Parallel Fast Fourier Transforms Gavin J. Pringle Joahcim Hein Introduction The Fourier

Lecture 5: Fourier Series and Discrete Fourier Transform Mark Hasegawa-Johnson ECE 401: Signal

Using sources ANU Academic Skills Workshop coverage Why use academic sources in your work?

Through Many-Valent Semantics Carolina Blasio IFCH/UNICAMP PhDs in Logic May 3 rd , BOCHUM

Efficient Malware Detection using Model-Checking Tayssir Touili LIPN, CNRS & Univ. Paris 13

Adiabatic theorems in quantum statistical mechanics and Landauer principle Vojkan Jaksic McGill

1. Tristan Benoist (LPT Toulouse) - Heat and work full

From Computing Science to Politics or From Apodictic Logic to Persuasive Logic Furio Honsell

The energy of Charged Matter Jan Philip Solovej Department of Mathematics University of

Pertinence Construed Modally Arina Britz 1 , 2 Johannes Heidema 2 Ivan Varzinczak 1 1 Meraka

Random Fourier Features for Kernel Ridge Regression Michael Kapralov - PowerPoint PPT Presentation

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H. Avron, C. Musco, C. Musco, A. Velingker and A. Zandieh) 1 / 43 Scalable machine learning algorithms with provable guarantees In this talk: towards

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Chapter 4 Chapter 4 The Fourier Series and The Fourier Series and Fourier Transform Fourier

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

Topic 5: Discrete-Time Fourier Transform (DTFT) o DT Fourier Transform o Overview of Fourier

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Fourier Transform for Partial Differential Equations Introduction: Fourier Transform

Signals and Systems Chapter 4: The Continuous Time Fourier Transform Derivation of the CT Fourier

Performance Guarantees for Random Fourier Features Limitations and Merits Zolt an Szab

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

and its Applications Karl Rupp karlirupp@hotmail.com Fourier Transform p.1/22 Content

Parallel Fast Fourier Transforms Gavin J. Pringle Joahcim Hein Introduction The Fourier

Lecture 5: Fourier Series and Discrete Fourier Transform Mark Hasegawa-Johnson ECE 401: Signal

Using sources ANU Academic Skills Workshop coverage Why use academic sources in your work?

Through Many-Valent Semantics Carolina Blasio IFCH/UNICAMP PhDs in Logic May 3 rd , BOCHUM

Efficient Malware Detection using Model-Checking Tayssir Touili LIPN, CNRS &amp; Univ. Paris 13

Adiabatic theorems in quantum statistical mechanics and Landauer principle Vojkan Jaksic McGill

1. Tristan Benoist (LPT Toulouse) - Heat and work full

From Computing Science to Politics or From Apodictic Logic to Persuasive Logic Furio Honsell

The energy of Charged Matter Jan Philip Solovej Department of Mathematics University of

Pertinence Construed Modally Arina Britz 1 , 2 Johannes Heidema 2 Ivan Varzinczak 1 1 Meraka

Efficient Malware Detection using Model-Checking Tayssir Touili LIPN, CNRS & Univ. Paris 13