Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection A Remedy Against the Curse of Dimensionality? Erich Schubert, Michael Gertz October 4, 2017, Munich, Germany Heidelberg University
t-Stochastic Neighbor Embedding t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD) 70 60 1 50 0.9 40 0.8 30 0.7 20 0.6 10 0.5 0 0.4 -10 0.3 -20 1 0.8 -30 0.2 0.6 0.4 -40 0.1 0.2 -50 0 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances. 1
t-Stochastic Neighbor Embedding t-SNE [MH08], based on SNE [HR02] is a popular “neural network” visualization technique using stochastic gradient descent (SGD) 70 60 1 50 0.9 40 0.8 30 0.7 20 0.6 10 0.5 0 0.4 -10 0.3 -20 1 0.8 -30 0.2 0.6 0.4 -40 0.1 0.2 -50 0 0 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 dimensional space 2 dimensional space Tries to preserve the neighbors – but not the distances. 1
t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 14 100 90 11 80 8 70 5 60 2 50 -1 40 -4 30 -7 20 -10 10 SNE 0 -13 0 10 20 30 40 50 60 70 80 90 100 -11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!
t-Stochastic Neighbor Embedding SNE [HR02] and t-SNE [MH08] are popular “neural network” visualization techniques using stochastic gradient descent (SGD) 50 100 40 90 30 80 20 70 10 60 0 50 -10 40 -20 30 -30 20 -40 10 t-SNE -50 0 -60 0 10 20 30 40 50 60 70 80 90 100 -50 -40 -30 -20 -10 0 10 20 30 40 50 SNE/t-SNE do not preserve density / distances. 2 Can get stuck in a local optimum!
t-Stochastic Neighbor Embedding SNE and t-SNE use a Gaussian kernel in the input domain: exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = k � = i exp( −� x i − x k � 2 / 2 σ 2 � i ) where each σ 2 i is optimized to have the desired perplexity (Perplexity ≈ number of neighbors to preserve) Asymmetric, so they simply use: p ij := ( p i | j + p j | i ) / 2 (We suggest to prefer p ij = � p i | j · p j | i for outlier detection) In the output domain, as q ij , SNE uses a Gaussian (with constant σ ), t-SNE uses a Student-t-Distribution. � Kullback-Leibler divergence can be minimized using stochastic gradient descent to make input and output affinities similar. 3
SNE vs. t-SNE Gaussian weights in the output domain as used by SNE vs. t-SNE: 0.4 Gaussian Weight, σ ²=1 Student-t Weight t=1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 2 t-SNE has more emphasis on separating points. � even neighbors will be “fanned out” a bit � “beter” separation of far points (SNE has 0 weight on far points) 4
The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. 5
The Curse of Dimensionality Loss of “discrimination” of distances [Bey+99]: � max y � = x d ( x,y ) − min y � = x d ( x,y ) � dim →∞ E lim → 0 . min y � = x d ( x,y ) � Distances to near points and to far points become similar. The Gaussian kernel uses relative distances: exp( −� x i − x j � 2 / 2 σ 2 i ) Distance Expected Distance With high-dimensional data, all p ij become similar! � We cannot find a “good” σ i anymore. 5
Distribution of Distances On the short tail distance distributions ofen look like this: 1 Neighbor Density, 3D Neighbor Density, 10D Neighbor Density, 50D 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 In high-dimensional data, almost all nearest neighbors concentrate on the right hand side of this plot. 6
Distribution of Distances Gaussian weights as used by SNE / t-SNE: 0.8 Gaussian Weight, σ ²=0.3 Gaussian Weight, σ ²=1 0.7 Gaussian Weight, σ ²=2 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 For low-dimensional data, Gaussian weights work good. For high-dimensional data: almost the same weight for all points. 7
Distribution of Distances Gaussian kernels adjusted for intrinsic dimensionality: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Gaussian Weight, id=3, σ ²=1 0.05 Gaussian Weight, id=10, σ ²=1 Gaussian Weight, id=50, σ ²=1 0 0 0.2 0.4 0.6 0.8 1 In theory, they behave like Gaussian kernels in low dimensionality. 8
Distance Power Transform Let X be a random variable (“of distances”) as in [Hou15], For constants c and m , use the transformation with g ( x ):= c · x m Y = g ( X ) Let F X , F Y be the cumulative distribution of X , Y . ID F X ( x ) = m · ID F Y ( c · x m ) Then [Hou15, Table 1]. By choosing m = ID F X ( x ) /t for any t > 0 , one therefore obtains: ID F Y ( c · x m ) = ID F X ( x ) /m = t where one can choose c > 0 as desired, e.g., for numerical reasons. � We can transform distances to any desired ID = t ! 9
Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? 10
Distance Power Transform For each point p : 1. Find k ′ nearest neighbors of p (should be k ′ > 100 , k ′ > k ) 2. Estimate ID at p 3. Choose m = ID F X ( x ) /t , t = 2 , c = k -distance 4. Transform distances: d ′ ( p, q ) := c · d ( p, q ) m 5. Use Gaussian kernel, perplexity, t-SNE, ... Can we defeat the curse this easily? Probably not: this is a hack to cure one symptom. Qestion: is our definition of ID too permissive? 10
Experimental Results: it-SNE Projections of the ALOI outlier data set (as available at [Cam+16]): PCA t-SNE it-SNE Data set: Color histograms of 50.000 photos of 1000 objects Each class: same object, different angles & different light Labeled outliers: classes reduced to 1-3 objects — May contain other “true” outliers! 11
Experimental Results: it-SNE Projection of the ALOI outlier data set with t-SNE: 11
Experimental Results: it-SNE Projection of the ALOI outlier data set with it-SNE: Labeled & Unlabeled Outliers! 11
Experimental Results: it-SNE On the well-known MNIST data set t-SNE: 12
Experimental Results: it-SNE On the well-known MNIST data set it-SNE: Outliers! 12
Outlier Detection: ODIN ODIN (Outlier Detection using Indegree Number) [HKF04]: 1. Find the k nearest neighbors of each object. 2. Count how ofen each object was returned. = in-degree of the k nearest neighbor graph 3. Objects with no (or fewest) occurrences are outliers. Works, but many objects will have the exact same score. Which k to use? Can change abruptly with k . Can we make a continuous (“smooth”) version of this idea? 13
Outlier Detection: SOS SOS (Stochastic Outlier Selection) [JPH13] Idea: assume every object can link to one neighbor randomly. Inliers: likely to be linked to, outliers: likely to be not linked to. 1. Compute p j | i of SNE / t-SNE for all i, j : exp( −� x i − x j � 2 / 2 σ 2 i ) p j | i = � k � = i exp( −� x i − x k � 2 / 2 σ 2 i ) use Gaussian weights to prefer near neighbors. 2. The SOS outlier score is then: � SOS( x j ) := i � = j 1 − p j | i = probability that no neighbor links to object j . 14
KNNSOS and ISOS Outlier Detection We propose two variants of this idea: 1. Since most p j | i will be zero, use only the k nearest neighbors. Reduces runtime from O ( n 2 ) to possibly O ( n log n ) , O ( n 4 / 3 ) . � KNNSOS( x j ) := i ∈ k NN ( x j ) 1 − p j | i 2. Estimate ID( x i ) , and use transformed distances for p j | i . ISOS : Intrinsic-dimensionality Stochastic Outlier Selection Note: The t-SNE author, van der Maaten, already proposed an approximate and index-based variant of t-SNE: Barnes-Hut t-SNE, which also uses the k NN only [Maa14]. 15
Recommend
More recommend