Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel ´ A. Carreira-Perpi˜ n´ an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu June 29, 2012
Introduction We focus on graph-based dimensionality reduction techniques: ◮ Input is a (sparse) affinity matrix. ◮ Objective function is a minimization over the location of the latent points. ◮ Examples: • Spectral methods: Laplacian Eigenmaps (LE), LLE; ✓ have a closed-form solution; ✗ results are often not satisfactory. • Nonlinear methods: SNE, s-SNE, t -SNE, elastic embedding (EE); ✓ produce good quality embedding; ✗ notoriously slow to train, limited to small data sets. One reason for slow training is inefficient optimization algorithms that take many iterations and move very slowly towards a solution. 2
COIL-20 Dataset Rotations of 10 objects every 5 ◦ ; input is greyscale images of 128 × 128. . . . Elastic Embedding Laplacian Eigenmaps 5 2.5 2 1.5 0 1 0.5 0 −5 −0.5 −1 −10 −1.5 −10 −8 −6 −4 −2 0 2 4 6 8 −2 −1 0 1 2 3
Teaser We are proposing a new training algorithm that: ◮ generalizes over multiple algorithms (s-SNE, t -SNE, EE); ◮ fast (1-2 orders of magnitude compared to current techniques); ◮ allows deep, inexpencive steps; ◮ scalable to larger datasets; ◮ intuitive and easy to implement. 4
General Embedding Formulation (Carreira-Perpi˜ n´ an 2010) For Y = ( y 1 , . . . , y N ) ∈ R D × N matrix of high-dimensional points and X = ( x 1 , . . . , x N ) ∈ R d × N matrix of low-dimensional points, define an objective function: E ( X , λ ) = E + ( X ) + λ E − ( X ) λ ≥ 0 � ☼ E + is the attractive term : ◮ often quadratic, ◮ minimal with coincident points; � ☼ E − is the repulsive term : � ☼ ◮ often very nonlinear, ◮ minimal with points separated infinitely. � ☼ Optimal embeddings balance both forces. 5
Example: SNE (Hinton & Roweis 2003) Define P n and Q n as distributions for each data point over the neighbors in high- and low-dimensional spaces respectively: exp( − � y n − y m � 2 exp( − � x n − x m � 2 ) ) σ 2 p nm = ; q nm = k =1 , k � = n exp( − � y n − y m � 2 k =1 , k � = n exp( − � x n − x m � 2 ) � N � N ) σ 2 The goal is to position points X such that P n matches the Q n for every n : N � E SNE ( X ) = D ( P n � Q n ) n =1 N N p nm log p nm � � = = − p nm log q nm + C q nm n , m =1 n , m =1 N N p nm � x n − x m � 2 + � � � exp ( − � x n − x m � 2 ) + C = log n , m =1 n =1 m � = n E + ( X ) + λ E − ( X ) = (In this formulation λ = 1) 6
General Embedding Formulation: Other Special Cases E + ( X ) E − ( X ) N N N e −� x n − x m � 2 � p nm � x n − x m � 2 � � log SNE: (Hinton&Roweis,’03) n , m =1 n =1 m =1 N N e −� x n − x m � 2 � p nm � x n − x m � 2 � log s-SNE: (Cook at al,’07) n , m =1 n , m =1 N N � p nm log (1 + � x n − x m � 2 ) � (1 + � x n − x m � 2 ) − 1 log t -SNE: (van der Maaten & n , m =1 n , m =1 Hinton,’08) N N nm e −� x n − x m � 2 � w + nm � x n − x m � 2 � w − EE: (Carreira-Perpi˜ n´ an,’10) n , m =1 n , m =1 N � w + nm � x n − x m � 2 0 LE & LLE: (Belkin & Niyogi,’03) n , m =1 (Roweis & Saul,’00) s.t. constraints w + nm and w − nm are affinity matrices elements 7
Optimization Strategy Look for a search direction p k at iteration k as a solution of a linear system B k p k = − g k , where g k is the current gradient and B k is a partial Hessian matrix. B k = I (grad. descent) more Hessian information B k = ∇ 2 E (Newton’s method) − − − − − − − − − − − − − − − → faster convergence rate We want B k : ◮ contain as much information about the Hessian as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N . After p k is obtained, a line search algorithm finds the step size α for the next iteration X k +1 = X k + α p k . We used backtracking line search. 8
Structure of the Hessian of the Generalized Embedding Given a symmetric matrix of weights W , we can always define its degree �� N � matrix D = diag n =1 w nm and its graph Laplacian L = D − W . L is positive semi-definite (psd) when entries of W are non-negative. The Nd × Nd Hessian can be written in terms of certain graph Laplacians: L = L + − λ L − ; ∇ 2 E + ( X ) = L + ⊗ I d ∇ 2 E = 4 L ⊗ I d L + is psd and data-independent for Gaussian kernel. +8 L xx data-dependent, overall not definite, but has psd diagonal blocks. † − 16 λ vec ( XL q ) vec ( XL q ) T always negative definite. † † exact expressions for L xx and L q are in the paper. Thus, there are several choices for psd parts of the Hessian: ◮ The best choice depends on the problem. ◮ We focus in particular on the one that does generally well. 9
The Spectral Direction (definition) ∇ 2 E = 4 L ⊗ I d +8 L xx − 16 λ vec ( XL q ) vec ( XL q ) T ↓ L + − λ L − B k = 4 L + ⊗ I d is a convenient Hessian approximation: ◮ equal to the Hessian of the spectral methods: ∇ 2 E + ( X ); ◮ always psd ⇒ global convergence under mild assumptions; ◮ block-diagonal and has d blocks of N × N graph Laplacian 4 L + ; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X ; ◮ “bends” the gradient of the nonlinear E using the curvature of the spectral E + ; 10
The Spectral Direction (computation) We need to solve a linear system B k p k = g k efficiently for every iteration (naively O ( N 3 d )). ◮ Cache the (also sparse) Cholesky factor of L + in the first iteration. Now, there are just two triangular systems for each iteration. ◮ For scalability, we can make W + even more sparse than it was with a κ -NN graph ( κ ∈ [1 , N ] is a user parameter). This affects only the runtime, convergence is still guaranteed. ◮ B k is psd ⇒ add small constant µ to the diagonal elements. Cost per iteration O ( N 2 d ) Objective function O ( N 2 d ) Gradient O ( N κ d ) Spectral direction This strategy adds almost no overhead when compared to the objective function and the gradient computation. 11
The Spectral Direction (pseudocode) SpectralDirection( X 0 , W + , κ ) (optional) Further sparsify W + with κ -NN graph L + ← D + − W + Compute graph Laplacian O ( N ) R ← chol ( L + + µ I ) compute Cholesky decomposition O ( N 2 κ ) k ← 1 repeat Compute E k and g k Objective function and the gradient O ( N 2 d ) p k ← − R − T ( R − 1 g k ) Solve two triangular systems O ( N κ d ) α ← backtracking line search X k ← X k − 1 + α p k k ← k + 1 until stop return X 12
Experimental Evaluation: Methods Compared • Gradient descent (GD), B k = I (Hinton&Roweis,’03) • Diagonal methods: B k = 4 D + ⊗ I d ◮ fixed-point iterations (FP), (Carreira-Perpi˜ n´ an,’10) B k = 4 D + ⊗ I d + 8 λ D xx ◮ the diagonal of the Hessian (DiagH); • Our methods: B k = 4 L + ⊗ I d ◮ spectral direction (SD); ◮ partial Hessian SD–, B k = 4 L + ⊗ I d + 8 λ L xx i ∗ , i ∗ solve linear system with conjugate gradient; • Standard large-scale methods: ◮ nonlinear Conjugate Gradient (CG); ◮ L-BFGS. 13
COIL-20. Convergence to the same minimum, EE Initialize X 0 close enough to X ∞ so that all methods have the same initial and final points. 1.8 GD FP Objective function 1.6 DiagH SD SD– 1.4 L-BFGS CG 1.2 1 0 1 2 3 4 −1 0 1 2 10 10 10 10 10 10 10 10 10 Number of iterations Runtime (seconds) 14
COIL-20. Convergence from random initial X , s-SNE Run the algorithms 50 times for 20 seconds each with different initialization. 10.5 GD FP Objective function value 10.4 DiagH SD SD– 10.3 L-BFGS CG 10.2 10.1 0 100 200 300 400 500 Number of iterations Animation 15
MNIST. t -SNE N = 20 000 images of handwritten digits (each a 28 × 28 pixel grayscale image, D = 784). 1 hour of optimization. 19.81 Objective function value 19.39 18.97 18.57 FP 18.18 SD 17.79 SD– 17.41 L-BFGS 17.04 CG 16.68 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40 45 50 55 60 Number of iterations Runtime (minutes) 16
MNIST. Embedding after 1 hour of t -SNE optimization Fixed-point iteration Spectral direction 3 40 30 2 20 1 10 0 0 −10 −1 −20 −2 −30 −3 −40 −3 −2 −1 0 1 2 3 −40 −30 −20 −10 0 10 20 30 40 50 Animation 17
Conclusions ◮ We presented a common framework for many well-known dimensionality reduction techniques. ◮ We showed the role of graph Laplacians in the Hessian and derived several partial Hessian optimization strategies. ◮ We presented the spectral direction : a new simple, generic and scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods. ◮ The evaluation of E and ∇ E remains the bottleneck ( O ( N 2 d )) that can be addressed in the future works (e.g. with Fast Multipole Methods). ◮ Matlab code: https://eng.ucmerced.edu/people/vladymyrov/ . Partially supported by NSF CAREER award IIS–0754089. 18
Recommend
More recommend