by Joshua Tan, for Ufora & NYU Capstone, 12/16/2014 a library for � topology and data topological data analysis � and manifold learning
what is… dimensionality reduction Given some input space X and a sample set S , dimensionality reduction seeks to find a lower-dimensional manifold M s.t. S ⊂ M ⊂ X. � � Also known as manifold learning.
examples ❖ Kernel PCA projects up into the feature space, projects down onto the components, ranks by eigenvalues � ❖ Isomap (i.e. MDS) embeds high-d points to low-d space while preserving a dissimilarity (distance) matrix � ❖ Projection pursuit projects to the most “interesting” components according to some objective function � ❖ DBSCAN , which considers not only distances but some “density-reachability” from a cluster
Mapper ❖ Like DBSCAN, Mapper is a clustering/dimensionality reduction algorithm based on varying both a distance parameter s well as a “density” parameter � ❖ Unlike DBSCAN, Mapper is designed to be less dependent on the choice of parameters
example: breast from Nicolau et al. 2011 cancer
computing Mapper 1. generate a sample data set as a DataFrame object � 2. compute a 1-d dissimilarity matrix of distances � 3. evaluate the points using a knn-neighbors filter function � 4. define a covering of the resulting image � 5. use the pre-image of this covering to define a covering of the original data � 6. from the covering, generate a clustering of the data � 7. visualize the result as a graph � For more complicated filter functions f : X \to R^2, the generated graph will be a simplicial complex.
“connecting” the dots figures borrowed from Michael Lesnick, on IAS eNews
persistent homology ❖ Persistent homology is a technique—read, a technical tool — for computing the “shape” of data sets � ❖ In some sense, the global counterpart to Mapper
computing persistent homology ❖ Take your point cloud S and turn it into a nested sequence of simplicial complexes, a.k.a. a filtration. � � � � ❖ Zomorodian and Carlsson (2004) specify a natural algorithm for computing the homology of a filtered d-dimensional simplicial complex K , assuming we evaluate the homology over a field . � ❖ This returns a “persistent bar code”.
example: natural image statistics Data from Mumford et al.: 4167 images, randomly sample 5000 3 pixel by 3 pixel images from each image. Take the ones with highest contrast, obtain 8,000,000 points in R^9. � � Normalize w.r.t. mean intensity, project onto high- contrast images (those away from the origin). Obtain points on S^7. � � M[k,T] is the subset of M in the upper T percent of density as measured by δ k (the k-nn distance). �
Ufora ❖ Ufora is a data analytics startup based in NYC � ❖ For my project, I implemented both the Mapper algorithm and a persistent homology library in their proprietary language, Fora � ❖ https://dev.ufora.com/#/projects/mapper/HEAD/ mapper
future directions
bibliography ❖ Carlsson, Gunnar. “Topology and data”. � ❖ Zomorodian, Afra. “Computing persistent homology”. � ❖ Ghrist, Robert. “Barcodes: the persistent homology of data”. � ❖ Singh, Gurjeet. “Topological methods for the analysis of high dimensional data sets and 3D object recognition”. � ❖ Mullner, Daniel. Python Mapper at danifold.net/mapper � ❖ Blum, Avrim. “Thoughts on clustering”.
Recommend
More recommend