Learning the Density Structure of High-Dimensional Data Yoshua - PowerPoint PPT Presentation

Learning the Density Structure of High-Dimensional Data Yoshua Bengio Work done with Martin Monperrus

Real Goals of Statistical Learning • Given a set D of l examples x t coming from an unknown distribution or process. • Discover structure in that distribution (= departures from uniformity and independence) so as to be able to make predictions about new combinations of values. • Grossly: where are zones of high density vs low density? • Generalization: inference must work on new examples from the same distribution. • With high-dimensional data , new examples tend to be “far” for training data.

Spectral Embedding Algorithms Algorithms for estimating a training set embedding on the presumed data manifold from the principal eigenvectors of a Gram matrix M with M ij = K D ( x i , x j ) from data-dependent kernel K D . • Examples: LLE (Roweis & Saul 2000), Isomap (Tenenbaum et al 2000), Laplacian Eigenmaps (Belkin & Niyogi 2003), spectral clustering (Weiss 99), kernel PCA (Schölkopf et al 98). Each corresponds to different K D . (fig. Roweis & Saul ) • Attractiveness: represent non-linear manifolds with analytic solution.

Out-of-Sample Embedding = Induction • How to generalize to new examples without recomputing eigenvectors? • Are there corresponding induction algorithms ? • Out-of-sample generalization with the Nyström formula: n e k ( x ) = 1 � v ki K D ( x, x i ) x λ k i =1 for k -th coordinate, with ( λ k , v k ) the k -th eigenpair of M . • This is an estimator of the eigenfunctions of K D as | D | → ∞ (see upcoming Neural Comp. paper, on my web page).

Tangent Plane ⇐ ⇒ Embedding Function tangent directions tangent plane Data on a curved manifold Important observation: The tangent plane at x is simply the subspace spanned by the gradient vectors of the embedding function: ∂e k ( x ) ∂x

Local Manifold Learning • Local Manifold Learning Algorithms: derive information about the manifold structure near x using mostly the neighbors of x . • For LLE, kernel PCA with Gaussian kernel, spectral clustering, Laplacian Eigenmaps K D ( x, y ) → 0 for x far from y , so e k ( x ) only depends on the neighbors of x . • Therefore the tangent plane ∂e k ( x ) also only depends on the ∂x neighbors of x . • ⇒ can’t say anything about the manifold structure near a new example x that is “far” from training examples!

LLE: Local Affine Structure The LLE algorithm estimates the local coordinates of each example in the basis of its nearest neighbors. Then looks for a low-dimensional coordinate system that has about the same expansion. Variations on the local plane around point i are writ- ten � ∆ x = α j d ij x j ∈N ( x i ) where d ij = ( x i − x j ) are local “tangent directions” which are learned separately for each zone around a point x i .

ISOMAP Isomap estimates the geodesic distance along the manifold using the shortest path in the nearest neighbors graph. It then looks for a low-dimensional representation that approximates those geodesic distances in the least square sense (MDS). Lemma: the tangent plane at x of the manifold estimated by Isomap are included in the span of the vectors x − x j where x j are training set neighbors of x (in the sense of being the first neighbor on the path from x to one of the training examples). Isomap is also a local manifold learning algorithm!

Pancake Mixture Models Other local manifold learning algorithms, density mixture models of flattened Gaussians: • Mixtures of factor analyzers (Ghahramani & Hinton 96) • Mixtures of probabilistic PCA (Tipping & Bishop 99) • Manifold Parzen Windows (Vincent & Bengio 2003) • Automatic Alignment of Local Representations (Teh & Roweis 2003) • Manifold Charting (Brand 2003) Some provide both density and embedding.

Local Manifold Learning: Local Linear Patches Current manifold learning algorithms cannot handle highly curved manifolds because they are based on locally linear patches estimated locally. tangent image tangent directions high−contrast image shifted image tangent image tangent directions

Fundamental Problems with Local Manifold Learning • High Noise: constraints not perfectly satisfied. Data not strictly on manifold. More noise → more data needed per local patch. High Curvature: need more smaller patches O( (1 /r ) d ) with r = • patch radius decreasing with curvature. High Manifold Dimension: O ((1 /r ) d ) patches are needed (curse of • dimensionality), at least O ( d ) examples per patch ( ∝ noise). • Many manifolds: e.g. images of transformed object instances = 1 manifold per instance or per object class. Local manifold learning can’t take advantage of shared structure across multiple manifolds.

Non-Local Tangent Plane Predictors Proposed approach: estimate tangent plane basis vectors as a function of position x in input space, with flexibly parametrized matrix-valued d × n function F ( x ) . Train F ( x ) to approximately span the differences between x and its neighbors. Experiments: estimate F with a simple neural network. Training criterion = relative projection error at examples x t and their neighbors x i : || F ′ ( x t ) w tj − ( x t − x j ) || 2 � � min || x t − x j || 2 F, { w tj } t j ∈N ( x t ) Double-optimization → Given F , analytic solution for each vector w tj , can easily do stochastic gradient descent on F ’s parameters.

Results with Tangent Plane Predictors Generalization of Tangent Learning 4 3 2 1 0 −1 −2 0 1 2 3 4 5 6 7 8 9 10 Task 1: 2-D data with 1-D sinusoidal manifolds: the method indeed captures the tangent planes. Small blue segments are the estimated tangent planes. Red points are training examples.

Results with Tangent Plane Predictors Relative Projection Error 0.22 Analytic Tangent Learning 0.2 DimNN Local PCA 0.18 0.16 0.14 0.12 0.1 0.08 0.06 1 2 3 4 5 6 7 8 9 10 Task 2: 41-dimensional Gaussian curves x ( i ) = e t 1 − ( − 2+ i/ 10) 2 /t 2 with two coordinates t 1 and t 2 . Relative projection error for k -th nearest neighbor, w.r.t. k from 1 to 5, for the four compared methods.

Results with Tangent Plane Predictors Task 3: 1000 digit images + image with rotation = 2 examples / manifold. Images are 14 × 14 of 10 digits from MNIST database. testing on MNIST digits Average relative projection error analytic tangent plane 0.27 tangent learning 0.43 Dim-NN or Local PCA 1.50 2 2 2 2 Tangent 4 4 4 4 vector 6 6 6 6 8 8 8 8 on test 10 10 10 10 image 12 12 12 12 14 14 14 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 of 8: image analytic tan. learn. local PCA

Truly Out-of-Sample Generalization Model was trained on digits 0 to 9: test it on letter M Compare predicted tangent vectors: 2 2 2 4 4 4 6 6 6 8 8 8 10 10 10 12 12 12 14 14 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 image tan. learn. local PCA Not surprisingly, local manifold learning fails, whereas the globally estimated tangent plane predictor generalizes to very different image!

Conclusions • Amazing progress in unsupervised learning the last few years: non-linear manifolds can be learned, with easy to optimize convex criteria. • Can be extended to embedding function induction → generalization. • Unfortunately they are estimating manifold tangents based on purely local information, which is very sensitive to four problems: noise, curvature, dimensionality and multiple disjoint manifolds. • N.B. same problem with non-parametric semi-supervised learning! • Proposed solution: learn a globally estimated tangent plane predictor function. • Works superbly in all three experimental setups tested. NOT CONVEX ANYMORE. BUT WORKS .

Future Work • Proposed algorithm estimates principal directions of Gaussian covariance everywhere! • Using existing algorithms (Brand 2003;Teh & Roweis 2003), predicted Gaussian covariance at centers x i can be converted into (1) A Gaussian mixture density function (globally estimated!) (2) A globally coherent embedding. • Exotic Extension : uncountable Gaussian mixture. Follow random walk which moves x to x + ∆ x , with ∆ x sampled from p ( x + ∆ x | x ) from local covariance at x . Density = normalized eigenfunction p ( x ) solving � p ( x ) p ( y | x ) dx = p ( y ) Can be estimated by solving finite linear system from data + random walk samples x t , yielding a solution of the form p ( x ) = � i α t p ( x | x t ) .

Learning the Density Structure of High-Dimensional Data Yoshua - PowerPoint PPT Presentation

Learning the Density Structure of High-Dimensional Data Yoshua Bengio Work done with Martin Monperrus Real Goals of Statistical Learning Given a set D of l examples x t coming from an unknown distribution or process. Discover

Polyethylene Monomer: Ethylene High Density Polyethylene (HDPE) Low Density Polyethylene

Relative Density Chapters 3.5 Relative Density 1 2/5/2015 Minimum Density Pluviate soil from

The Dark Matter density MW Components Global density Data: inner Data: outer Data: masers

Bulk Density and Void Content Bulk Density Bulk density ( n .) the mass of a unit volume of bulk

FLOCK: A Density Based Clustering Method for FLOCK: A Density Based Clustering Method for

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Phase structure of finite density Phase structure of finite density lattice QCD by a histogram

High-density data storage: principle Current approach High density 1 bit = many domains 1 bit =

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Density Ratio Estimation Density Ratio Estimation in Machine Learning in Machine Learning

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Generative adversarial networks Ian Jean Mehdi Goodfellow Pouget-Abadie Mirza David Bing

Revealing the secrets of success Theoretical efficiency of side-channel distinguishers Annelie

CSCE 970 Lecture 2: most probable class Bayesian-Based Classifiers Given M classes 1 , . .

Lecture 3 Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear

Cluster-based Segmentation Algorithm Why Shift What is Mean Idea K-means++ based Algorithm

LEARNING Master in Artificial Intelligence Reference Christopher M. Bishop - Pattern

Learning with kernels and SVM malova chata, 23. kv etna, 2006 Petra Kudov malka,

Smooth Local Histograms Filters Micheal Kass and Justin Solomon Yeara Kozlov Saarland University