Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings - PowerPoint PPT Presentation

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel ´ A. Carreira-Perpi˜ n´ an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu August 30, 2012

Dimensionality reduction Given a high-dimensional dataset Y = ( y 1 , . . . , y N ) ⊂ R D find a low-dimensional representation X = ( x 1 , . . . , x N ) ⊂ R d where d ≪ D . 15 15 14 10 13 12 5 11 10 0 Y 3 X 2 9 −5 8 7 −10 6 5 −15 20 15 4 0 2 4 6 8 10 12 14 16 18 20 15 10 X 1 5 10 0 5 −5 0 −10 Y 2 Y 1 Can be used for: ◮ Data compression. ◮ Visualization. ◮ Detect latent manifold structure. ◮ Fast search. ◮ . . . 2

Graph-based dimensionality reduction techniques ◮ Input: (sparse) affinity matrix W defined on a set of high-dimensional points Y . ◮ Objective function: minimization over the latent points X . ◮ Examples: • Spectral methods: Laplacian Eigenmaps (LE), LLE; ✓ closed-form solution; ✗ results can be bad. • Nonlinear methods: SNE, t -SNE, elastic embedding (EE); ✓ better results; ✗ slow to train, limited to small data sets. 3

COIL-20 Dataset Rotations of 10 objects every 5 ◦ ; input is greyscale images of 128 × 128. Y : . . . Elastic Embedding Laplacian Eigenmaps 5 2.5 2 1.5 0 1 0.5 0 −5 −0.5 −1 −10 −1.5 −10 −8 −6 −4 −2 0 2 4 6 8 −2 −1 0 1 2 4

General Embedding Formulation (Carreira-Perpi˜ n´ an 2010) For Y ∈ R D × N matrix of high-d points and X ∈ R d × N low-d points E ( X , λ ) = E + ( X ) + λ E − ( X ) λ ≥ 0 � ☼ E + ( X ) is the attractive term : ◮ often quadratic, ◮ minimal with coincident points; � ☼ E − ( X ) is the repulsive term : � ☼ ◮ often very nonlinear, ◮ minimal with points separated infinitely. � ☼ Optimal embeddings balance both forces. 5

General Embedding Formulation: Special Cases E + ( X ) E − ( X ) N N N e −� x n − x m � 2 � p nm � x n − x m � 2 � � log SNE: (Hinton&Roweis,’03) n , m =1 n =1 m =1 N N � p nm log (1 + � x n − x m � 2 ) � (1 + � x n − x m � 2 ) − 1 log t -SNE: (van der Maaten & n , m =1 n , m =1 Hinton,’08) N N nm e −� x n − x m � 2 nm � x n − x m � 2 � w + � w − EE: (Carreira-Perpi˜ n´ an,’10) n , m =1 n , m =1 N nm � x n − x m � 2 � w + 0 LE & LLE: (Belkin & Niyogi,’03) n , m =1 (Roweis & Saul,’00) s.t. constraints w + nm and w − nm are affinity matrices elements 6

Optimization Strategy For every iteration k : 1. Choose positive definite B k . 2. Solve a linear system B k p k = − g k for a search direction p k , where g k is the gradient. 3. Use line search to find a step size α for the next iteration X k +1 = X k + α p k . (e.g. with backtracking line search). Convergence is guaranteed! (under mild assumptions) 7

How to choose good B k ? Solve linear system B k p k = − g k : B k = I (grad. descent) more Hessian information B k = ∇ 2 E (Newton’s method) − − − − − − − − − − − − − − − → faster convergence rate 5 5 4 B k = I 3 4 2 our B k 1 3 0 B k = ∇ 2 E 2 1 8 6 0 2 4 6 8 We want B k : ◮ contain as much Hessian information as possible; ◮ positive definite (pd); ◮ fast to solve the linear system and scale up to larger N . 8

The Spectral Direction The Hessian of the generalized embedding formulation is given by: ∇ 2 E 4( L + − λ L − ) ⊗ I d + 8 L xx − 16 λ vec ( XL q ) vec ( XL q ) T = where L + , L − , L xx , L q are graph Laplacians. B = 4 L + ⊗ I d is a convenient Hessian approximation: ◮ block-diagonal and has d blocks of N × N graph Laplacian 4 L + ; ◮ always psd ⇒ global convergence under mild assumptions; ◮ constant for Gaussian kernel. For other kernels we can fix it at some X ; ◮ equal to the Hessian of the spectral methods: ∇ 2 E + ( X ); ◮ “bends” the gradient of the nonlinear E using the curvature of the spectral E + ; 9

The Spectral Direction (computation) Solve Bp k = g k efficiently for every iteration k (naively O ( N 3 d )): ◮ Cache Cholesky factor of L + in first iteration. ◮ (Further) sparsify the weights of L + with a κ -NN graph. Runtime is faster and convergence is still guaranteed. Cost per iteration O ( N 2 d ) Objective function O ( N 2 d ) Gradient O ( N κ d ) Spectral direction This strategy adds almost no overhead when compared to the objective function and the gradient computation. 10

Experimental Evaluation: Methods Compared Now: ◮ Gradient descent (GD), B = I (Hinton&Roweis,’03) B = 4 D + ⊗ I d ◮ fixed-point iterations (FP), (Carreira-Perpi˜ n´ an,’10) B = 4 L + ⊗ I d ◮ Spectral direction (SD), ◮ L-BFGS. More experiments and methods at the poster: ◮ Hessian diagonal update; ◮ nonlinear Conjugate Gradient; ◮ some other interesting partial-Hessian update. 11

COIL-20. Convergence analysis, s-SNE COIL-20 dataset of rotated objects ( N = 720, D = 16 384, d = 2). Run the algorithms 50 times for 30 seconds each initialized randomly. 10.3 Objective function value 10.2 Gradient Desent Fixed-point it. Spectral direction L-BFGS 10.1 0 100 200 300 400 500 Animation Number of iterations 12

MNIST. t -SNE ◮ N = 20 000 images of handwritten digits (each a 28 × 28 pixel grayscale image, D = 784). ◮ One hour of optimization on a modern computer with one CPU. 19.8 19.4 Objective function value 19 18.6 Fixed-point it. 18.2 Spectral direction 17.8 L-BFGS 17.4 17 16.7 0 5 10 15 20 25 30 35 40 45 50 55 60 Animation Runtime (minutes) 13

Conclusions ◮ We presented a common framework for many well-known dimensionality reduction techniques. ◮ We presented the spectral direction : a new simple, generic and scalable optimization strategy that runs one to two orders of magnitude faster compared to traditional methods. ◮ Matlab code: http://eecs.ucmerced.edu/ . Ongoing work: ◮ The evaluation of E and ∇ E remains the bottleneck ( O ( N 2 d )). We can use Fast Multipole Methods to speed up the runtime. ◮ Avoid line search, use constant, near-optimal step sizes. 14

MNIST. Embedding after 20 min of EE optimization Fixed-point iteration Spectral direction 4 6 3 4 2 2 1 0 0 −1 −2 −2 −4 −3 −6 −4 −4 −3 −2 −1 0 1 2 3 4 −6 −4 −2 0 2 4 6 Animation 15

COIL-20. Convergence to the same minimum, s-SNE We initialized X 0 close enough to X ∞ so that all methods have the same initial and final points. Objective function, s-SNE GD 10.35 FP DiagH 10.3 SD SD– 10.25 L-BFGS CG 10.2 10.15 0 1 2 3 4 −1 0 1 2 10 10 10 10 10 10 10 10 10 Number of iterations Runtime (seconds) 16

COIL-20: Homotopy optimization for EE Start with small λ where E is convex and follow the path of minima to desired λ by minimizing over X as λ increases. We used 50 log-spaced values from 10 − 4 to 10 2 . 3 4 10 Number of iterations 10 2 10 3 10 Time, s 1 10 2 10 0 10 DiagH GD 1 SD– FP 10 −1 10 L-BFGS SD CG 0 −2 10 10 −2 −1 0 1 2 −2 −1 0 1 2 10 10 10 10 10 10 10 10 10 10 λ λ Animation 17

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings - PowerPoint PPT Presentation

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced https://eecs.ucmerced.edu August 30,

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

Twisted Hessian curves cr.yp.to/papers.html#hessian Daniel J. Bernstein University of Illinois

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Overview Partial Constituent Fronting in German The phenomenon: Partial constituent fronting

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

Twisted Hessian curves 1986 ChudnovskyChudnovsky, Sequences of numbers

Structure of the Hessian Graham C. Goodwin September 2004 Centre for Complex Dynamic Systems

HESSIAN vs OFFSET method PDF4LHC F b PDF4LHC February 2008 2008 A M Cooper-Sarkar Comparisons

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Nonlinear Control Lecture # 1 Introduction Nonlinear Control Lecture # 1 Introduction Nonlinear

Numerical Proofs in Nonlinear Control Sicun Gao, UCSD Nonlinear control working Nonlinear

Partial Functions and Categories of Partial Maps Science Atlantic at Acadia University Darien

Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020

Securing Real-Time Microcontroller Systems through Customized Memory View Switching + * Chung

TAMA Data Analysis 8th GWDAW, Milwaukee WI, USA, 16th Dec. 2003 Nobuyuki Kanda Department of

Introduction CSCE CSCE 496/896 496/896 Lecture 9: Lecture 9: word2vec and word2vec and To

EMC BWE PANDA Services and Mounting 27-Apr-15 HIM - EMC BW Endcap 1 Boundaries 27-Apr-15

Proposed Reclassification of Allen Creek and Maiden Creek in Catawba and Lincoln Counties

Updates to the Tompkins County Unique Natural Area Inventory