Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June 24th, 2014 ICML, 2014, Beijing Joint work with Vikas Sindhwani, Haim Avron and Michael Mahoney
Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results
Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results
Problem setting We will start with the kernelized ridge regression problem, n 1 ( y i − f ( x i )) 2 + λ � f � 2 � min H . (1) n f ∈H i =1 where x i ∈ R d , H is a nice hypothesis space (RKHS) and ℓ is a convex loss function. ◮ A symmetric and positive-definite kernel k ( x , y ) generates a unique RKHS H . ◮ For example, RBF kernel, k ( x , y ) = e − � x − y � 2 2 σ 2 . ◮ Kernel methods are widely used in solving regression, classification or inverse problems raised in many areas as well as unsupervised learning problems.
Scalability ◮ By the Representer Theorem, the minimizer of (1) can be represented by c = ( K + λ nI ) − 1 Y . ◮ Above the Gram matrix K is defined as K ij = k ( x i , x j ). Forming n × n matrix K needs O ( n 2 ) storage and typical linear algebra needs O ( n 3 ) running time. ◮ This is an n × n dense linear system which is not scalable for large n .
Linear kernel and explicit feature maps ◮ Suppose we can find a feature map Ψ : X → R s such that k ( x , y ) = z ( x ) T z ( y ). Then the Gram matrix K = ZZ T , where the i -th row of Z is z ( x i ) and Z ∈ R n × s . ◮ The solution to (1) can be expressed as w = ( Z T Z + λ nI ) − 1 Z T Y . ◮ This is an s × s linear system. ◮ It is attractive if s < n . ◮ Testing times reduces from O ( nd ) to O ( s + d ).
Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results
Mercer’s Theorem and explicit feature map Theorem (Mercer) For any positive definite kernel k, it can be expanded into N F � k ( x , y ) = λ i φ i ( x ) φ i ( y ) . i =1 ◮ Can define Φ( x ) = ( √ λ 1 φ 1 ( x ) , . . . , � λ N F φ N F ( x )). ◮ For many kernels, such as RBF, N F = ∞ . ◮ Goal : Find explicit feature map z ( x ) ∈ R s such that k ( x , y ) ≃ z ( x ) T z ( y ) , where s < n . Then K ≃ ZZ T .
Bochner’s Theorem Theorem (Bochner) A continuous kernel k ( x , y ) = k ( x − y ) on R d is positive definite if and only if k ( x − y ) is the Fourier transform of a non-negative measure.
A Monte Carlo Approximation ◮ More specifically, given a shift-invariant kernel k , we have � R d e − i w T ( x − y ) p ( w )d w . k ( x , y ) = k ( x − y ) = ◮ By standard Monte Carlo (MC) approach, the above can be approximated by s k ( x , y ) = 1 e − i w T ˜ � j ( x − y ) , (2) s j =1 where w j are drawn from p ( w ).
Random Fourier feature ◮ The random Fourier feature map can be defined as ψ ( x ) = 1 √ s ( g w 1 ( x ) , . . . , g w s ( x )) , where g w j ( x ) = e − i w T j x . [Rahimi and Recht 07]. ◮ So s k ( x , y ) = 1 j ( x − y ) = ψ ( x ) T ¯ e − i w T ˜ � ψ ( y ) . s j =1
Motivation ◮ We want to use less random features while maintaining the same approximation accuracy. ◮ MC method has a convergence rate of O (1 / √ s ). ◮ To gain a faster convergence, quasi-Monte Carlo method will be a better choice since it has a convergence rate of O ((log s ) d / s ).
Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results
Quasi-Monte Carlo method Goal To approximate an integral over the d -dimensional unit cube [0 , 1] d , � I d ( f ) = [0 , 1] d f ( x ) d x 1 · · · d x d . Quasi-Monte Carlo methods usually take the following form, s Q s ( f ) = 1 � f ( t i ) , s i =1 where t 1 , . . . , t s ∈ [0 , 1] d are pseudo-random points chosen deterministically with low-discrepancy.
Low-discrepancy sequences ◮ Many pseudo-random sequences { t i } ∞ i =1 with low-discrepancy are available, such as Halton sequence and Sobol’ sequence. ◮ They tend to be more “uniform” than sequence drawn uniformly. ◮ Notice the clumping and the space with no points in the left subplot. Uniform Halton 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Quasi-random features ◮ By setting w = Φ − 1 ( t ), k ( x , y ) can be rewritten as � � R d e − i ( x − y ) T w p ( w ) d w [0 , 1] d e − i ( x − y ) T Φ − 1 ( t ) d t = s 1 e − i ( x − y ) T Φ − 1 ( t j ) . � ≈ (3) s j =1 ◮ After generating the low discrepancy sequence { t j } s j =1 , the � s quasi-random features can be represented by 1 j =1 g t i ( x ), s where g t j ( x ) = e − i x T Φ − 1 ( t j ) .
Algorithm: Quasi-Random Fourier Features Input: Shift-invariant kernel k , size s . Ψ( x ) : R d �→ C s . Output: Feature map ˆ 1: Find p , the inverse Fourier transform of k . 2: Generate a low discrepancy sequence t 1 , . . . , t s . 3: Transform the sequence: w j = Φ − 1 ( t j ). � � e − i x T w 1 , . . . , e − i x T w s � 4: Set ˆ 1 Ψ( x ) = . s
Quality of Approximation ◮ Given a pair of points x , y , let u = x − y . The approximation error is s � R d f u ( w ) p ( w ) d w − 1 � ǫ [ f u ] = f u ( w i ) , s i =1 where f u ( w ) = e i u T w . ◮ Want to characterize the behavior of ǫ [ f u ] when u ∈ ¯ X and ¯ X = { x − z | x , z ∈ X} . ◮ Consider a broader class of integrands, F � b = { f u | u ∈ � b } . Here � b = { u ∈ R d | | u j | ≤ b j } and ¯ X ∈ � b .
Main Theoretical Result Theorem (Average Case Error) Let U ( F � b ) denote the uniform distribution on F � b . That is, f ∼ U ( F � b ) denotes f = f u where f u ( x ) = e − i u T x and u is randomly drawn from a uniform distribution on � b . We have, π d p ( S ) 2 . ǫ S , p [ f ] 2 � D � b � = E f ∼U ( F � b ) � d j =1 b j
Box discrepancy Suppose that p ( · ) is a probability density function, and that we can write p ( x ) = � d j =1 p j ( x j ) where each p j ( · ) is a univariate probability density function as well. Let φ j ( · ) be the characteristic function associated with p j ( · ). Then, � b j d � D sinc b , p ( S ) 2 ( π ) − d | φ j ( β ) | 2 d β − = − b j j =1 � b j s d 2(2 π ) − d � � φ j ( β ) e iw lj β d β + s − b j l =1 j =1 s s 1 � � sinc b ( w l , w j ) . (4) s 2 l =1 j =1
Proof techniques ◮ Consider integrands to be in some Reproducing Kernel Hilbert Space (RKHS). Uniform bound for approximating error can be derived by standard arguments. ◮ Here we consider the space of functions that admit an integral representation over F � b of the form, � f ( u ) e − i u T x d u where ˆ ˆ f ( x ) = f ( u ) ∈ L 2 ( � b ) . (5) u ∈ � b These spaces are called Paley-Wiener spaces PW b and they constitute a RKHS. ◮ The damped approximations of the integrands in F � b of form f u ( x ) = e − i u T x sinc( T x ) are members of PW b with � ˜ ˜ 1 f � PW b = T . √ Hence, we expect D � b to provide a discrepancy measure for p integrating functions in F � b .
Brief Overview of Kernel Methods Low-dimensional Explicit Feature Map Quasi-Monte Carlo Random Feature Empirical Results
Approximation error on Gram matrix Euclidean norm Frobenius norm Euclidean norm Frobenius norm Digital Net 0.15 Digital Net Digital Net Digital Net 0.045 0.06 0.09 MC MC MC MC Halton Halton 0.04 Halton Halton 0.08 Sobol’ Sobol’ Sobol’ 0.05 Sobol’ Relative error on ||K|| Relative error on ||K|| 0.035 Lattice Lattice Lattice Lattice 0.07 0.03 0.04 0.1 0.06 0.025 0.05 0.03 0.02 0.04 0.015 0.02 0.03 0.05 0.01 0.01 0.02 0.005 0.01 200 400 600 800 200 400 600 800 200 400 600 800 200 400 600 800 Number of random features Number of random features Number of random features Number of random features (a) MNIST (b) CPU Figure : Relative error on approximating the Gram matrix measured in Euclidean norm and Frobenius norm, i.e. � K − ˜ K � 2 / � K � 2 and � K − ˜ K � F / � K � F , for various s . For each kind of random feature and s , 10 independent trials are executed, and the mean and standard deviation are plotted.
Generalization error s Halton Sobol’ Lattice Digit MC 0.0367 0.0383 0.0374 0.0376 0.0383 100 (0) (0.0015) (0.0010) (0.0010) (0.0013) cpu 0.0339 0.0344 0.0348 0.0343 0.0349 500 (0) (0.0005) (0.0007) (0.0005) (0.0009) 0.0334 0.0339 0.0337 0.0335 0.0338 1000 (0) (0.0007) (0.0004) (0.0003) (0.0005) 0.0529 0.0747 0.0801 0.0755 0.0791 400 (0) (0.0138) (0.0206) (0.0080) (0.0180) census 0.0553 0.0588 0.0694 0.0587 0.0670 1200 (0) (0.0080) (0.0188) (0.0067) (0.0078) 0.0498 0.0613 0.0608 0.0583 0.0600 1800 (0) (0.0084) (0.0129) (0.0100) (0.0113) Table : Regression error, i.e. � ˆ y − y � 2 / � y � 2 where ˆ y is the predicted value and y is the ground truth.
Recommend
More recommend