entropic affinities properties and efficient numerical
play

Entropic Affinities: Properties and Efficient Numerical Computation - PowerPoint PPT Presentation

Entropic Affinities: Properties and Efficient Numerical Computation Max Vladymyrov and Miguel Carreira-Perpin Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013 Summary


  1. Entropic Affinities: Properties and Efficient Numerical Computation Max Vladymyrov and Miguel Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013

  2. Summary •The entropic affinities define affinities so that each point has an effective number of neighbors equal to K. •First introduced in: G. E. Hinton & S. Roweis: " Stochastic Neighbor Embedding ", NIPS 2002 . •Not in a widespread use, even though they work well in a range of problems. •We study some properties of entropic affinities and give fast algorithms to compute them. 2

  3. Affinity matrix Defines a measure of similarity between points in the dataset. Used in: Affinity matrix Data set • Dimensionality reduction: 1 ‣ Stochastic Neighbor Embedding, t -SNE, 0.9 Elastic Embedding, Laplacian Eigenmaps. 0.8 • Clustering: 0.7 0.6 ‣ Mean-Shift, Spectral clustering. 0.5 • Semi-supervised learning. 0.4 0.3 • and others 0.2 The performance of the algorithms depends crucially of the affinity construction, govern by the bandwidth . σ Common practice to set : σ • constant, • rule-of-thumb (e.g. distance to the 7th nearset neighbor, Zelnik & Perona, 05). 3

  4. Motivation: choice of σ COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128 Affinity matrices: Rule-of-thumb: Constant sigma Entropic affinities Dist. to the 7th nn (Zelnik & Perona, 05) − 5 − 6 − 4 x 10 x 10 x 10 16 7 3.5 14 6 3 12 5 2.5 10 4 2 8 3 1.5 6 2 1 4 1 0.5 2 0 0 4

  5. Motivation: choice of σ COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128 Dimensionality Reduction with Elastic Embedding algorithm: Rule-of-thumb: Constant sigma Entropic affinities Dist. to the 7th nn (Zelnik & Perona, 05) 5

  6. Search for good σ Good should be: σ • Set separately for every data point. • Take into account the whole distribution of distances. σ n x n σ n x n x 1 x 2 x 1 x 2 6

  7. Entropic affinities In the entropic affinities, the is set individually for each point such that it has a σ distribution over neighbors with fixed perplexity (Hinton & Rowies, 2003). K x 1 , . . . , x N ∈ R D x ∈ R D • Consider a distribution of the neighbors for : x N � || ( x − x n ) / σ || 2 � K p n ( x ; σ ) = P N � || ( x − x k ) / σ || 2 � k =1 K posterior distribution of Kernel Density Estimate. x 2 x • The entropy of the distribution is defined as H ( x , σ ) = − P N n =1 p n ( x , σ ) log( p n ( x , σ )) • Consider the bandwidth (or precision ) given the perplexity : 1 x 1 β = K σ 2 σ 2 H ( x , β ) = log K • Perplexity of in a distribution over neighbors provides the same surprise N K p as if we were to choose among equiprobable neighbors. K • We define entropic affinities as probabilities for with respect p = ( p 1 , . . . , p N ) x to . Thos affinities define a random walk matrix. β 7

  8. Entropic affinities: example 8

  9. Entropic affinities: properties H ( x n , β n ) ≡ − P N n =1 p n ( x n , β n ) log( p n ( x n , β n )) = log K • This is a root-finding problem or an 1 D inversion problem . β n = H − 1 x n (log K ) • Should be solved for 6 x n ∈ x 1 , . . . , x N H(x, � ) β ∗ • We can prove that: K=30 4 ‣ The root-finding problem is well log(K) defined for a Gaussian kernel for 2 any , and has a unique root β n > 0 for any . K ∈ (0 , N ) 0 ‣ The inverse is a uniquely defined − 2 − 1 0 1 2 3 log( � ) continuously differentiable function for all and . x n ∈ R N K ∈ (0 , N ) 9

  10. Entropic affinities: bounds The bounds for every and : [ β L , β U ] x n ∈ R N K ∈ (0 , N ) d N 0 1 s d 2 N log N log N A , K K β L = max , d 1 @ ( N − 1) ∆ 2 d 4 N − d 4 1 N 6 ✓ ◆ β U = 1 p 1 log ( N − 1) β ∗ , β U ∆ 2 1 − p 1 4 2 log(K) ∆ 2 2 = d 2 2 − d 2 where , , and is a unique ∆ 2 N = d 2 N − d 2 p 1 β L 2 1 1 solution of the equation √ N � � 2(1 − p 1 ) log 2(1 − p 1 ) = log min( 2 N, K ) 0 − 2 − 1 0 1 2 3 The bounds are computed in for each point. O (1) log( � ) 10

  11. Entropic affinities: computation For every x n ∈ x 1 , . . . , x N H ( x n , β n ) = log K x N 1. Initialize as close to the root as β n possible. 2. Compute the root . β n x 1 x 2 11

  12. 1. Computation of ; the root-finding β n Convergence Number of . O ( N ) Methods Meth Derivatives order order evaluations Bisection linear 0 1 Derivative- Derivative- Brent linear 0 1 free free Ridder quadratic 0 2 Newton quadratic 1 2 Derivative- Derivative- Halley cubic 2 3 based based Euler cubic 2 3 • The cost of the objective function evaluation and each of derivative is . O ( N ) • Derivative-free methods above generally converge globally. They work by iteratively shrinking an interval bracketing the root. • Derivative-based methods have higher convergence order, but may diverge. 12

  13. Robustified root-finding algorithm •We embed the derivative-based algorithm into bisection loop for global convergence. •We run the following algorithm for each x n ∈ x 1 , . . . , x N β K Input: initial , perplexity , 5.5 1 d 2 1 , . . . , d 2 , bounds . B distances 5 N while true do 4.5 k = 1 for to maxit do 4 β compute using a derivative- 3.5 log(K) based method 3 if tolerance achieved return 2.5 β / ∈ B if exit for loop 2 B update 1.5 H( � ) end for log(K) β 1 compute using bisection iterations B 0.5 update − 10 − 5 0 5 log( � ) end while 13

  14. Robustified root-finding algorithm •We embed the derivative-based algorithm into bisection loop for global convergence •We run the following algorithm for each x n ∈ x 1 , . . . , x N Bisection: step is outside the brackets β K Input: initial , perplexity , 5.5 1 d 2 1 , . . . , d 2 , bounds . B distances N 5 while true do 4.5 k = 1 for to maxit do 4 β compute using a derivative- 3.5 log(K) based method 3 if tolerance achieved return 2.5 β / ∈ B if exit for loop 2 B update H( � ) 1.5 log(K) end for Newton β 1 compute using bisection iterations B 0.5 update − 10 − 5 0 5 log( � ) end while 14

  15. Robustified root-finding algorithm •We embed the derivative-based algorithm 3.6 into bisection loop for global convergence 3.5 •We run the following algorithm for each 3.4 x n ∈ x 1 , . . . , x N 3.3 3.2 − 3.7 − 3.6 − 3.5 − 3.4 − 3.3 − 3.2 − 3.1 − 3 Normal step β K Input: initial , perplexity , 5.5 2 1 d 2 1 , . . . , d 2 , bounds . B distances N 5 while true do 4.5 k = 1 for to maxit do 4 β compute using a derivative- 3.5 log(K) based method 3 if tolerance achieved return 2.5 β / ∈ B if exit for loop 2 B update H( � ) 1.5 log(K) end for Newton β 1 compute using bisection iterations B 0.5 update − 10 − 5 0 5 log( � ) end while 15

  16. Robustified root-finding algorithm •We embed the derivative-based algorithm 3.6 into bisection loop for global convergence 3.5 •We run the following algorithm for each 3.4 x n ∈ x 1 , . . . , x N 3.3 3.2 − 3.7 − 3.6 − 3.5 − 3.4 − 3.3 − 3.2 − 3.1 − 3 Normal step β K Input: initial , perplexity , 5.5 2 3 1 d 2 1 , . . . , d 2 , bounds . B distances N 5 while true do 4.5 k = 1 for to maxit do 4 β compute using a derivative- 3.5 log(K) based method 3 if tolerance achieved return 2.5 β / ∈ B if exit for loop 2 B update H( � ) 1.5 log(K) end for Newton β 1 compute using bisection iterations B 0.5 update − 10 − 5 0 5 log( � ) end while 16

  17. Robustified root-finding algorithm •We embed the derivative-based algorithm 3.6 into bisection loop for global convergence 3.5 •We run the following algorithm for each 3.4 x n ∈ x 1 , . . . , x N 3.3 3.2 − 3.7 − 3.6 − 3.5 − 3.4 − 3.3 − 3.2 − 3.1 − 3 Normal step β K Input: initial , perplexity , 5.5 2 3 4 1 d 2 1 , . . . , d 2 , bounds . B distances N 5 while true do 4.5 k = 1 for to maxit do 4 β compute using a derivative- 3.5 log(K) based method 3 if tolerance achieved return 2.5 β / ∈ B if exit for loop 2 B update H( � ) 1.5 log(K) end for Newton β 1 compute using bisection iterations B 0.5 update − 10 − 5 0 5 log( � ) end while 17

  18. 2. Initialization of β n 1. Simple initialization: • midpoint of the bounds, • distance to th nearest neighbor. k Typically far from root and require more iterations. 2. Each new is initialized from the β n solution to its predecessor: • sequential order; • tree order. We need to find orders that are correlated with the behavior of . β 18

  19. 2. Initialization of β n 1. Simple initialization: • middle of the bounds, • distance to th nearest neighbor. k Typically far from root and require more iterations. 2. Each new is initialized from the β n solution to its predecessor: • sequential order; • tree order. We need to find orders that are correlated with the behavior of . β 19

Recommend


More recommend