Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel ´ A. Carreira-Perpi˜ n´ an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu August 30, 2012
Dimensionality reduction Given a high-dimensional dataset Y = ( y 1 , . . . , y N ) ⊂ R D find a low-dimensional representation X = ( x 1 , . . . , x N ) ⊂ R d where d ≪ D . 15 15 14 10 13 12 5 11 10 0 Y 3 X 2 9 −5 8 7 −10 6 5 −15 20 15 4 0 2 4 6 8 10 12 14 16 18 20 15 10 X 1 5 10 0 5 −5 0 −10 Y 2 Y 1 Can be used for: ◮ Data compression. ◮ Visualization. ◮ Detect latent manifold structure. ◮ Fast search. ◮ . . . 2
Graph-based dimensionality reduction techniques ◮ Input: (sparse) affinity matrix W defined on a set of high-dimensional points Y . ◮ Objective function: minimization over the latent points X . ◮ Examples: • Spectral methods: Laplacian Eigenmaps (LE), LLE; ✓ closed-form solution; ✗ results can be bad. • Nonlinear methods: SNE, t -SNE, elastic embedding (EE); ✓ better results; ✗ slow to train, limited to small data sets. 3
COIL-20 Dataset Rotations of 10 objects every 5 ◦ ; input is greyscale images of 128 × 128. Y : . . . Elastic Embedding Laplacian Eigenmaps 5 2.5 2 1.5 0 1 0.5 0 −5 −0.5 −1 −10 −1.5 −10 −8 −6 −4 −2 0 2 4 6 8 −2 −1 0 1 2 4
General Embedding Formulation (Carreira-Perpi˜ n´ an 2010) For Y ∈ R D × N matrix of high-d points and X ∈ R d × N low-d points E ( X , λ ) = E + ( X ) + λ E − ( X ) λ ≥ 0 � ☼ E + ( X ) is the attractive term : ◮ often quadratic, ◮ minimal with coincident points; � ☼ E − ( X ) is the repulsive term : � ☼ ◮ often very nonlinear, ◮ minimal with points separated infinitely. � ☼ Optimal embeddings balance both forces. 5
General Embedding Formulation: Special Cases E + ( X ) E − ( X ) N N N e −� x n − x m � 2 � p nm � x n − x m � 2 � � log SNE: (Hinton&Roweis,’03) n , m =1 n =1 m =1 N N � p nm log (1 + � x n − x m � 2 ) � (1 + � x n − x m � 2 ) − 1 log t -SNE: (van der Maaten & n , m =1 n , m =1 Hinton,’08) N N nm e −� x n − x m � 2 nm � x n − x m � 2 � w + � w − EE: (Carreira-Perpi˜ n´ an,’10) n , m =1 n , m =1 N nm � x n − x m � 2 � w + 0 LE & LLE: (Belkin & Niyogi,’03) n , m =1 (Roweis & Saul,’00) s.t. constraints w + nm and w − nm are affinity matrices elements 6
Optimization Strategy For every iteration k : 1. Choose positive definite B k . 2. Solve a linear system B k p k = − g k for a search direction p k , where g k is the gradient. 3. Use line search to find a step size α for the next iteration X k +1 = X k + α p k . (e.g. with backtracking line search). Convergence is guaranteed! (under mild assumptions) 7
How to choose good B k ? Solve linear system B k p k = − g k : B k = I (grad. descent) more Hessian information B k = ∇ 2 E (Newton’s method) − − − − − − − − − − − − − − − → faster convergence rate 5 5 4 B k = I 3 4 2 our B k 1 3 0 B k = ∇ 2 E 2 1 8 6 0 2 4 6 8 We want B k : ◮ contain as much Hessian information as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N . 8
The Spectral Direction The Hessian of the generalized embedding formulation is given by: ∇ 2 E 4( L + − λ L − ) ⊗ I d + 8 L xx − 16 λ vec ( XL q ) vec ( XL q ) T = where L + , L − , L xx , L q are graph Laplacians. B = 4 L + ⊗ I d is a convenient Hessian approximation: ◮ block-diagonal and has d blocks of N × N graph Laplacian 4 L + ; ◮ always psd ⇒ global convergence under mild assumptions; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X ; ◮ equal to the Hessian of the spectral methods: ∇ 2 E + ( X ); ◮ “bends” the gradient of the nonlinear E using the curvature of the spectral E + ; 9
The Spectral Direction (computation) Solve Bp k = g k efficiently for every iteration k (naively O ( N 3 d )): ◮ Cache Cholesky factor of L + in first iteration. ◮ (Further) sparsify the weights of L + with a κ -NN graph. Runtime is faster and convergence is still guaranteed. Cost per iteration O ( N 2 d ) Objective function O ( N 2 d ) Gradient O ( N κ d ) Spectral direction This strategy adds almost no overhead when compared to the objective function and the gradient computation. 10
Experimental Evaluation: Methods Compared Now: ◮ Gradient descent (GD), B = I (Hinton&Roweis,’03) B = 4 D + ⊗ I d ◮ fixed-point iterations (FP), (Carreira-Perpi˜ n´ an,’10) B = 4 L + ⊗ I d ◮ Spectral direction (SD), ◮ L-BFGS. More experiments and methods at the poster: ◮ Hessian diagonal update; ◮ nonlinear Conjugate Gradient; ◮ some other interesting partial-Hessian update. 11
COIL-20. Convergence analysis, s-SNE COIL-20 dataset of rotated objects ( N = 720, D = 16 384, d = 2). Run the algorithms 50 times for 30 seconds each initialized randomly. 10.3 Objective function value 10.2 Gradient Desent Fixed-point it. Spectral direction L-BFGS 10.1 0 100 200 300 400 500 Animation Number of iterations 12
MNIST. t -SNE ◮ N = 20 000 images of handwritten digits (each a 28 × 28 pixel grayscale image, D = 784). ◮ One hour of optimization on a modern computer with one CPU. 19.8 19.4 Objective function value 19 18.6 Fixed-point it. 18.2 Spectral direction 17.8 L-BFGS 17.4 17 16.7 0 5 10 15 20 25 30 35 40 45 50 55 60 Animation Runtime (minutes) 13
Conclusions ◮ We presented a common framework for many well-known dimensionality reduction techniques. ◮ We presented the spectral direction : a new simple, generic and scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods. ◮ Matlab code: http://eecs.ucmerced.edu/ . Ongoing work: ◮ The evaluation of E and ∇ E remains the bottleneck ( O ( N 2 d )). We can use Fast Multipole Methods to speed up the runtime. ◮ Avoid line search, use constant, near-optimal step sizes. 14
MNIST. Embedding after 20 min of EE optimization Fixed-point iteration Spectral direction 4 6 3 4 2 2 1 0 0 −1 −2 −2 −4 −3 −6 −4 −4 −3 −2 −1 0 1 2 3 4 −6 −4 −2 0 2 4 6 Animation 15
COIL-20. Convergence to the same minimum, s-SNE We initialized X 0 close enough to X ∞ so that all methods have the same initial and final points. Objective function, s-SNE GD 10.35 FP DiagH 10.3 SD SD– 10.25 L-BFGS CG 10.2 10.15 0 1 2 3 4 −1 0 1 2 10 10 10 10 10 10 10 10 10 Number of iterations Runtime (seconds) 16
COIL-20: Homotopy optimization for EE Start with small λ where E is convex and follow the path of minima to desired λ by minimizing over X as λ increases. We used 50 log-spaced values from 10 − 4 to 10 2 . 3 4 10 Number of iterations 10 2 10 3 10 Time, s 1 10 2 10 0 10 DiagH GD 1 SD– FP 10 −1 10 L-BFGS SD CG 0 −2 10 10 −2 −1 0 1 2 −2 −1 0 1 2 10 10 10 10 10 10 10 10 10 10 λ λ Animation 17
Recommend
More recommend