Block-Quantized Kernel Matrix for Fast Spectral Embedding Kai Zhang James T. Kwok Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong
Introduction The Proposed Method Experiments Conclusion Outline Introduction 1 Eigendecomposition of Kernel Matrix Scale-Up Methods The Proposed Method 2 Gram Matrix of Special Forms Basic Idea Matrix Approximation Matrix Quantization Density Weighted Nystr¨ om Extension Experiments 3 Kernel Principal Component Analysis Image Segmentation Conclusion 4
Introduction The Proposed Method Experiments Conclusion Outline Introduction 1 Eigendecomposition of Kernel Matrix Scale-Up Methods The Proposed Method 2 Gram Matrix of Special Forms Basic Idea Matrix Approximation Matrix Quantization Density Weighted Nystr¨ om Extension Experiments 3 Kernel Principal Component Analysis Image Segmentation Conclusion 4
Introduction The Proposed Method Experiments Conclusion Eigendecomposition of Kernel Matrix Eigen-decomposition of Kernel Matrix When do we need to eigen-decompose the kernel matrix? Kernel Principle Component Analysis A powerful tool to extract nonlinear structures in the high dimensional feature space (Sch¨ olkopf 1998). Spectral Clustering A global, pairwise clustering method based on graph partitioning theories (Shi & Malik, 2000). Manifold Learning and Dimensionality Reduction Laplacian Eigenmap, ISOMAP , Locally linear Embedding...
Introduction The Proposed Method Experiments Conclusion Scale-Up Methods Scale-Up Methods Low-rank approximation of the form L = GG ′ , where L ∈ R N × N , G ∈ R N × m and m ≪ N is the rank Incomplete Cholesky decomposition (Bach & Jordan, 2002; Fine & Scheinberg, 2001) Sparse greedy kernel methods (Smola &Bartlett, 2000) Sampling-based methods Nystr¨ om: randomly selects columns of the kernel matrix (Williams & Segger, 2001; Lawrence & Herbrich, 2005) Drineas & Mahoney (2005): chooses the columns based on a data-dependent probability Ouimet and Bengio (2005): uses a greedy sampling scheme based on the feature space geometry
Introduction The Proposed Method Experiments Conclusion Outline Introduction 1 Eigendecomposition of Kernel Matrix Scale-Up Methods The Proposed Method 2 Gram Matrix of Special Forms Basic Idea Matrix Approximation Matrix Quantization Density Weighted Nystr¨ om Extension Experiments 3 Kernel Principal Component Analysis Image Segmentation Conclusion 4
Introduction The Proposed Method Experiments Conclusion Gram Matrix of Special Forms Block Quantized Matrices Definition The block-quantized matrix W 1 contains m 2 constant blocks. a a b b b a a b b b The block at the i th row and j th 2 W = column, C ij , has dimension c c d d d c c d d d n i × n j , with entry value β ij . c c d d d E.g., n 1 = 2, n 2 = 3, β 11 = a , β 12 = b , 3 β 21 = c , β 22 = d . Note Block quantization can be performed by: partition the data set into m clusters; 1 set β ij = K ( t i , t j ) ( i , j = 1 , 2 , ..., m ), where t i is the 2 representative of the i th cluster.
Introduction The Proposed Method Experiments Conclusion Gram Matrix of Special Forms Properties of Block Quantized Matrices Eigensystem of W , W φ = λφ a a b b b φ 1 φ 1 a a b b b φ 2 φ 2 = λ c c d d d φ 3 φ 3 c c d d d φ 4 φ 4 c c d d d φ 5 φ 5 The first n 1 equations are the same, so are the next n 2 equations,..., and so on. It is equal to the m × m system � φ i , where � W � φ j = λ � W ij = β ij n j . How to recover the eigensystem of W from that of � W ? Eigenvalues: W and � W have the same eigenvalues. Eigenvectors: repeat the k th entry of � φ n k times, then we get φ (i.e., φ is piecewise constant).
Introduction The Proposed Method Experiments Conclusion Basic Idea Basic Idea Idea Utilize the blockwise structure of the kernel matrix W to compute the eigen-decomposition more efficiently. Procedure Find a blockwise-constant matrix W to approximate W . 1 Use the Frobenius norm � W − W � F as the approximation criteria. The eigen-system of the N × N matrix W can be fully 2 recovered from that of the m × m matrix � W . Use this as an approximate solution to the eigen-decomposition of W .
Introduction The Proposed Method Experiments Conclusion Matrix Approximation Approximation of Eigenvalues Matrix perturbation theory [Bhatia, 1992] Difference between two matrices can bound the difference between their singular value spectra. If A , E ∈ R m × n , and σ k ( A ) is the k th singular value of A , then 1 ≤ t ≤ n | σ t ( A + E ) − σ t ( A ) | ≤ � E � 2 , max n � ( σ k ( A + E ) − σ k ( A )) 2 ≤ � E � 2 F . k = 1
Introduction The Proposed Method Experiments Conclusion Matrix Approximation Approximation of Eigenvectors Our Analysis In some cases the eigenvectors are of greater importance, such as in manifold embedding, spectral clustering, etc. Let W and W be the original and block-quantized matrices, with eigen-value/vector pair ( α, µ ) and ( β, ν ) , respectively. Then we have � � α + 1 1 � W � 2 + 1 β � E � 2 , α ≤ β, β � � � µ − ν � ≤ β − 1 3 � W � 2 + 1 β � E � 2 , α > β. α Since � E � 2 ≤ � E � F , therefore by minimizing � E � F , we can also bound the approximation error of the eigenvectors.
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Minimization of the Matrix Approximation Error The objective E = � W − W � F can be written as � � 2 N m � � � � � 2 . E = W ij − W ij = W pq − β ij i , j = 1 i , j = 1 x p ∈ S i , x q ∈ S j Can be minimized by setting ∂ E ∂β ij = 0 to obtain � 1 β ij = K ( x p , x q ) . n i n j x p ∈ S i , x q ∈ S j Takes O ( N 2 ) time to compute the β ij ’s.
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Data Partitioning Assumption Suppose the data set is partitioned into clusters in the input space Local cluster S i has a minimum enclosing ball ( MEB ) with radius r i . The cluster representative t i should fall into this MEB . Question How does the partitioning influence the matrix approximation quality?
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Approximation error vs. Data Partitioning Upper Bound The approximation error E is bounded by � � E ≤ 64 N 2 ξ 2 R 2 1 D 2 + 4 R 2 + 4 DR , σ 4 �� � �� � x − y σ , width of the (stationary) kernel K ( x , y ) = k . σ ξ = max | k ′ ( x ) | . R = i = 1 , 2 ,..., m r i , maximum MEB radius. max � 1 D = ij n i n j D ij , average pairwise distance. N 2 � D 2 = 1 ij n i n j D 2 ij , average pairwise squared distance. N 2
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Sequential Sampling Objective Partition the data set into compact local clusters, such that every point is close to its cluster center. Procedure 1: Randomly select a sample to initialize the cluster center set C = { t 1 } . For i = 1 , 2 , . . . , N , do the following. 2: Compute l ij = � x i − t j � , t j ∈ C . Once l ij ≤ r , assign x i to S j , let i = i + 1, and go to the next step. 3: If � x i − t j � > r , ∀ t j ∈ C , add x i to C as a new center. Let i = i + 1 and go to the next step. 4: On termination, count the number of samples, n j , in S j , and � update each t j ∈ C as t j = 1 x i ∈ S j x i . n j
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Example: Sequential Sampling Data (left); Small threshold r (middle); Large threshold (right) 200 200 150 150 100 100 50 50 0 0 300 300 250 250 200 200 200 200 150 150 100 100 100 100 50 50 0 0 0 0 Property The local clusters are bounded by the hypercube of side length 2 r , where r is the partitioning parameter. The complexity is O ( N log m ) by using a hierarchical implementation.
Introduction The Proposed Method Experiments Conclusion Matrix Quantization Gradient Optimization The approximation error E = � W − W � F can be written as a function of the cluster representatives t i ’s, m � � � � 2 E = K ( x p , x q ) − K ( t i , t j ) i , j = 1 p ∈ S i , q ∈ S j which can be optimized using gradient descent � � � � t k − t j � 2 /σ 2 � − A kj K 2 � � t k − t j � 2 /σ 2 �� j � = k t j B kj K � � t k − t j � 2 /σ 2 � − A kj K 2 � � t k − t j � 2 /σ 2 � t k = � . j � = k B kj K Here A ij = n i n j , B ij = � p ∈ S i , q ∈ S j K ( x p , x q ) . The iteration can fine tune the cluster representatives especially when m is small.
Recommend
More recommend