topology and data
play

topology and data topological data analysis and manifold learning - PowerPoint PPT Presentation

by Joshua Tan, for Ufora & NYU Capstone, 12/16/2014 a library for topology and data topological data analysis and manifold learning what is dimensionality reduction Given some input space X and a sample set S , dimensionality


  1. by Joshua Tan, for Ufora & NYU Capstone, 12/16/2014 a library for � topology and data topological data analysis � and manifold learning

  2. what is… dimensionality reduction Given some input space X and a sample set S , dimensionality reduction seeks to find a lower-dimensional manifold M s.t. S ⊂ M ⊂ X. � � Also known as manifold learning.

  3. examples ❖ Kernel PCA projects up into the feature space, projects down onto the components, ranks by eigenvalues � ❖ Isomap (i.e. MDS) embeds high-d points to low-d space while preserving a dissimilarity (distance) matrix � ❖ Projection pursuit projects to the most “interesting” components according to some objective function � ❖ DBSCAN , which considers not only distances but some “density-reachability” from a cluster

  4. Mapper ❖ Like DBSCAN, Mapper is a clustering/dimensionality reduction algorithm based on varying both a distance parameter s well as a “density” parameter � ❖ Unlike DBSCAN, Mapper is designed to be less dependent on the choice of parameters

  5. example: breast from Nicolau et al. 2011 cancer

  6. computing Mapper 1. generate a sample data set as a DataFrame object � 2. compute a 1-d dissimilarity matrix of distances � 3. evaluate the points using a knn-neighbors filter function � 4. define a covering of the resulting image � 5. use the pre-image of this covering to define a covering of the original data � 6. from the covering, generate a clustering of the data � 7. visualize the result as a graph � For more complicated filter functions f : X \to R^2, the generated graph will be a simplicial complex.

  7. “connecting” the dots figures borrowed from Michael Lesnick, on IAS eNews

  8. persistent homology ❖ Persistent homology is a technique—read, a technical tool — for computing the “shape” of data sets � ❖ In some sense, the global counterpart to Mapper

  9. computing persistent homology ❖ Take your point cloud S and turn it into a nested sequence of simplicial complexes, a.k.a. a filtration. � � � � ❖ Zomorodian and Carlsson (2004) specify a natural algorithm for computing the homology of a filtered d-dimensional simplicial complex K , assuming we evaluate the homology over a field . � ❖ This returns a “persistent bar code”.

  10. example: natural image statistics Data from Mumford et al.: 4167 images, randomly sample 5000 3 pixel by 3 pixel images from each image. Take the ones with highest contrast, obtain 8,000,000 points in R^9. � � Normalize w.r.t. mean intensity, project onto high- contrast images (those away from the origin). Obtain points on S^7. � � M[k,T] is the subset of M in the upper T percent of density as measured by δ k (the k-nn distance). �

  11. Ufora ❖ Ufora is a data analytics startup based in NYC � ❖ For my project, I implemented both the Mapper algorithm and a persistent homology library in their proprietary language, Fora � ❖ https://dev.ufora.com/#/projects/mapper/HEAD/ mapper

  12. future directions

  13. bibliography ❖ Carlsson, Gunnar. “Topology and data”. � ❖ Zomorodian, Afra. “Computing persistent homology”. � ❖ Ghrist, Robert. “Barcodes: the persistent homology of data”. � ❖ Singh, Gurjeet. “Topological methods for the analysis of high dimensional data sets and 3D object recognition”. � ❖ Mullner, Daniel. Python Mapper at danifold.net/mapper � ❖ Blum, Avrim. “Thoughts on clustering”.

Recommend


More recommend