The Variational Nyström Method for Large-Scale Spectral Problems Max Vladymyrov Miguel Carreira-Perpiñán Google Inc. EECS, UC Merced June 20, 2016
Graph based dimensionality reduction methods Given high-dimensional data points . Y D × N = ( y 1 , . . . , y N ) 1.Convert data points to a affinity matrix . M N × N 2.Find low-dimensional coordinates , so X d × N = ( x 1 , . . . , x N ) that their similarity is as close as possible to . M High-dimensional Low-dimensional Affinity M input output Y X 100 80 60 40 R D 20 R d 20 40 60 80 100 2
Spectral methods • Consider a spectral problem: XX T = I , � XMX T � min X tr s.t. ‣ : symmetric psd affinity matrix. M N × N • Examples: ‣ Laplacian eigenmaps, is a graph Laplacian. M ‣ ISOMAP , is given by a matrix of shortest distances. M ‣ Kernel PCA, MDS, Locally Linear Embedding (LLE), etc. • Solution is unique and can be found in closed form from the eigenvectors of : . X = U T M M With large , solving the eigenproblem is infeasible even if N M is sparse. 3
Learning with landmarks Goal is find a fast, approximate solution for the embedding X using only the subset of the original points from . Y Select landmarks Compute reduced L Learn landmark Project the rest (e.g. random subset) . affinity matrix representation of the points L × L 0.5 5 0.4 0.3 10 0.2 R D 15 0.1 R d R d 20 5 10 15 20 4
Nyström method Writing the affinity matrix by blocks (landmarks first): M B T A A 21 M = C = B 21 B 21 B 22 The approximation to the eigendecomposition is equal to: ✓ ◆ U A e U M = B 21 U A Λ − 1 A Essentially, an out-of-sample formula: 1. Solve the eigenproblem for a subset of points. 2. Predict the rest of the points through the interpolation formula. 5
Column Sampling method Writing the affinity matrix by blocks (landmarks first): M B T A A 21 M = C = B 21 B 21 B 22 The approximation to the eigendecomposition is given by the left singular vectors of : C e C = U C Σ C V T U M = U C ⇒ C Uses more information from the affinity matrix than Nyström, but M still ignores non-landmark/non-landmark interaction part . B 22 6
Locally Linear Landmarks (LLL) (Vladymyrov & Carreira-Perpiñán, 2013) • Construct the local linear projection matrix from the input : Z Y y n ≈ P L Y ≈ e YZ T l =1 z ln e y l , n = 1 , . . . , N ⇒ • Additional assumption: this projection is satisfied in the X = e embedding space: . XZ T • Plugging the projection to the original obj. function: � XMX T � XX T = I , X = e XZ T min X tr s.t. ⇒ ⇣ X T ⌘ XZ T MZ e XZ T Z e X T = I e e min e X tr s.t. • The solution is given by the reduced generalized eigenproblem: X = eig( ZMZ T , ZZ T ) e • Final embedding are predicted as: . X = e XZ T • This solution is optimal given the constraint . X = e XZ T 7
Generalizing approximations Nyström: Expand the upper part: N × L ✓ ◆ ✓ AU A Λ − 1 ◆ } U A e = CU A Λ − 1 A U M = = B 21 U A Λ − 1 B 21 U A Λ − 1 A } A A L × d Column Sampling: C T C Rewrite using the eigendecomposition of matrix : L × L C = CU C T C Λ − 1 / 2 e U M = U C = CV C Σ − 1 C T C LLL: U M = Z e e X = e e X T XZ T Rewrite the solution as , where is X computed optimally (given ) as: Z X = eig( ZMZ T , ZZ T ) e 8
Generalizing approximations Nyström: 1. Solve the smaller eigendecomposition: L × L A = U A Λ A U T A 2. Apply out-of-sample matrix: N × L e U M = CU A Λ − 1 A Column Sampling: 1. Solve the smaller eigendecomposition: L × L C T C = U C T C Λ C T C U C T C 2. Apply out-of-sample matrix: N × L U M = CU C T C Λ − 1 / 2 e C T C LLL: 1. Solve the smaller eigendecomposition: L × L X = eig( ZMZ T , ZZ T ) e 2. Apply out-of-sample matrix: N × L U M = Z e e X T 9
Generalizing approximations Each approximation consist of the following steps: • define an out-of-sample matrix , Z N × L • compute some reduced eigenproblem and a matrix that Q L × d depends on it, e • final approximation is equal to . U M = ZQ Eigenproblem A U = B U Λ Q L × d Z N × L A , B U Λ − 1 A , I C Nyström Z T Z , I U Λ − 1 / 2 C Column Sampling ZMZ T , Z T Z Y ≈ e computed U YZ LLL from ZMZ T , Z T Z qr( M q S ) U Random Projection 10
Generalizing approximations Each approximation consist of the following steps: • define an out-of-sample matrix , Z N × L • compute some reduced eigenproblem and a matrix that Q L × d depends on it, e • final approximation is equal to . U M = ZQ Eigenproblem A U = B U Λ Q L × d Z N × L A , B U Λ − 1 A , I C Nyström Z T Z , I U Λ − 1 / 2 C Column Sampling ZMZ T , Z T Z Y ≈ e U computed YZ LLL from ZMZ T , Z T Z qr( M q S ) U Random Projection ZMZ T , ZZ T U C Variational Nyström 11
Variational Nyström Add this Nyström out-of-sample constraint to the spectral problem: � XMX T � XX T = I , X = e XC T min X tr s.t. ⇒ ⇣ X T ⌘ XC T MC e XC T C e X T = I e e min e X tr s.t. From LLL perspective: • replace customary built out-of-sample matrix with a readily Z available column matrix , C • abandon local linearity assumption of the weights , Z • save computation of , Z • is usually sparser than (due to locality). C Z 12
Variational Nyström Add this Nyström out-of-sample constraint to the spectral problem: � XMX T � XX T = I , X = e XC T min X tr s.t. ⇒ ⇣ X T ⌘ XC T MC e XC T C e X T = I e e min e X tr s.t. From Nyström perspective: • use the same out-of-sample matrix , but optimize the choice of C the reduced eigenproblem, • for fixed gives better approx. than Nyström or Column e Y Sampling ( optimal for the out-of-sample kernel ). C • uses all the elements from to construct the reduced M eigenproblem, • forgo the interpolating property of Nyström. 13
Subsampling graph Laplacian • Consider given by normalized graph Laplacian matrix: M L ∝ D − 1 / 2 WD − 1 / 2 - Gaussian affinity matrix: w nm = exp( �k y 2 n � y 2 m k / 2 σ 2 ) D = diag ( P N - Degree matrix: m =1 w nm ) • One of the most widely used kernel (Laplacian Eigenmaps, spectral clustering). • Graph Laplacian kernel is a data dependent : subset of graph Laplacian graph Laplacian computed for a subset L × L 6 = constructed for points. of input points L N L × L N × N 14
Subsampling graph Laplacian • Data dependance can be a problem for methods that depend on the subsampling: - Nyström, - Column Sampling, - Variational Nyström. • Not a problem methods for which there is no subsampling: - LLL, - Random projection. Our solution: normalize subsample kernel separately, but in a way that interpolates over the landmarks and gives exact solution when : L = N D 1 D − 1 / 2 D − 1 / 2 D 2 C M L → N 15
Subsampling graph Laplacian D 1 D − 1 / 2 D − 1 / 2 D 2 C M L → N • For Nyström and Column Sampling: • we propose different forms for and , D 1 D 2 • we evaluate empirically which one is the best. • For Variational Nyström: • we showed that factors out, D 2 • any leads to the exact solution when . L = N D 1 For the graph Laplacian kernel, the Variational Nyström approximation is more general. 16
Experiments: Laplacian eigenmaps • Reduce dimensionality of digits from MNIST . N = 20 000 d = 10 • Run 5 times for different randomly chosen landmarks from L = 11 to . L = 19 900 0 10 Error with respect to the exact − 1 10 Nys objfun − 2 CS 10 LLL oNys − 3 10 Halko(q=1) Halko(q=2) Halko(q=3) − 4 10 2 3 4 10 10 10 L Number of landmarks 17
Experiments: Laplacian eigenmaps • Reduce dimensionality of digits from MNIST . N = 20 000 d = 10 • Run 5 times for different randomly chosen landmarks from L = 11 to . L = 19 900 4 10 3 10 Runtime time 2 10 1 10 0 10 2 3 4 10 10 10 L Number of landmarks 18
Experiments: Laplacian eigenmaps • Reduce dimensionality of digits from MNIST . N = 20 000 d = 10 • Run 5 times for different randomly chosen landmarks from L = 11 to . L = 19 900 0 10 Error with respect to the exact − 1 10 objfun − 2 10 − 3 10 − 4 10 0 1 2 3 4 10 10 10 10 10 Runtime time 19
Experiments: Laplacian eigenmaps • Reduce dimensionality of digits from MNIST . N = 20 000 d = 10 • Run 5 times for different randomly chosen landmarks from L = 11 to . L = 19 900 0 10 Error with respect to the exact − 1 10 objfun − 2 10 Variational Nyström − 3 10 is winning! 2x as fast as LLL! − 4 10 0 1 2 3 4 10 10 10 10 10 Runtime time 19
Experiments: Spectral clustering Original image Exact Spectral clustering, t = 512 s Variational Nyström, Nyström, t = 25 s t = 25 s 20x speedup! 20
Recommend
More recommend