Categorical and continuous variables E.g. {browser=“firefox”, pages_visited < 10} → {sale_made=“no”} Easy approach: convert each level to a binary “item” (see “dummy” codification in pre- processing) and place continuous variables in bins before converting to binary value: {browser- firefox, pages-visited-0-to-20} → {no-sale-made} Group categorical variables with many levels Drop frequent levels which are not considered interesting, i.e. prevent browser-chrome turning up in rules if 90% of visitors use this browser Binning of continuous variables requires some fine-tuning (otherwise support or confidence too low) Better statistical methods are available, see e.g. R. Rastogi and Kyuseok Shim, “Mining optimized association rules with categorical and numeric attributes,” in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 29- 50 and others 20
Multi-level association rules E.g. product hierarchies Using lower levels only? Highest only? Something in-between? Possible, but items in lower levels might not have enough support Rules overly specific Srikant R. & Agrawal R., “Mining Generalized Association Rules”, In Proc. 1995 Int. Conf. Very Large Data Bases, Zurich, 1995 21
Discriminative association rules Bring in some “supervision” E.g. you’re interested in rules with outcome → {beer}, or perhaps → {beer, spirits} Interesting because multiple class labels can be combined in consequent, and consequent can also involve non-class labels to derive interesting correlations Learned patterns can be used as features for other learners, see e.g. “Keivan Kianmehr, Reda Alhajj. CARSVM: A class association rule-based classification framework and its application to gene expression data” 22
Tuning Possibility to tune “interestingness” Rare patterns: low support but still interesting E.g. people buying Rolex watches Mine by setting special group-based support thresholds (context matters) Negative patterns: E.g. people buying a Hummer and Tesla together will most likely not occur Negatively correlated, infrequent patterns can be more interesting than positive, frequent patterns Possibility to include domain knowledge Block frequent itemsets, e.g. if {cats, mice} occurs as a subset (no matter the support) Or allow certain itemsets, even if the support is low 23
Filtering Often lots of association rules will be discovered! Post-processing is a necessity Perform sensitivity analysis using minsup and minconf thresholds Trivial rules E.g., buy spaghetti and spaghetti sauce Unexpected / unknown rules Novel and actionable patterns, potentially interesting! Confidence might not always be the best metric (lift, conviction, rarity, …) Appropriate visualisation facilities are crucial! Association rules can be powerful but really requires that you “get to work” with the results! See e.g. arulesViz package for R: https://journal.r-project.org/archive/2017/RJ- 2017-047/RJ-2017-047.pdf 24
Filtering 25
Applications Market basket analysis {baby food, diapers} => {stella} Put them closer together in the store? Put them far apart in the store? Package baby food, diapers and stella? Package baby food, diapers and stella with a poorly selling item? Raise the price on one, and lower it on the other? Do not advertise baby food, diapers and stella together? Up, down, cross-selling Basic recommender systems Customers who buy this also frequently buy… Can be a very simple though powerful approach To generate features {Insurer 64, Car Repair Shop A, Police Officer B} as frequent pattern in fraud analytics But be wary of data leakage (train/test split applies!) 26
Sequence mining In standard apriori, the order of items does not matter Instead of item-sets, think now about item-bags and item-sequences Sets: unordered, each element appears at most once: every transaction is a set of items Bags: unordered, elements can appear multiple times: every transaction is a bag of items Sequences: ordered, every transaction is a sequence of items 27
Sequence mining Mining of frequent sequences: algorithm very similar to apriori (i.e. GSP, Generalized Sequential Pattern algorithm) Start with the set of frequent 1-sequences But: expansion (candidate generation) done differently: E.g. in normal apriori, {A, B} and {A, C} would both be expanded into the same set {A, B, C} For sequences, suppose we have <{A}, {B}> and <{A}, {C}>, then these are now expanded (joined) into <{A}, {B}, {C}> and <{A}, {C}, {B}> and <{A}, {B, C}> Often modified to suite particular use cases Common case: just consider sequences with sets containing one item only: e.g. and expanded into and E.g. in web mining or customer journey analytics Pruning k -sequences with infrequent k -1 subsequences, only continue with support higher than threshold 28
Sequence mining Extensions exist that take timing into account 29
Sequence mining Extension: frequent episode mining For very long time series Find frequent sequences within the time series Extension: continuous time series mining By first binning the continuous time series in categorical levels (similar as with normal apriori) Extension: discriminative sequence mining Again: if you know the outcome of interest (i.e. sequences which lead customer to buy a certain product) See e.g. SPMF : http://www.philippe-fournier-viger.com/spmf/ for a large collection of algorithms 30
Sequence mining “Sankey diagram” 31
Conclusion Main power of apriori comes from the fact that it can easily be extended and adapted towards specific settings Also means that you’ll probably have to go further than “out of the box” approaches Keep in mind post-disovery filtering needs to be applied Keep in mind the possibility to make this more supervised, or use it as a feature-generating tool Many other extensions exist (e.g for frequeny sub-graphs) 32
Clustering 33
Clustering Cluster analysis or clustering is the task of grouping a set of objects In such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) It is a main task of exploratory data mining, and a common technique for statistical data analysis Used in many fields: machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics, customer segmentation, etc… Organizing data into clusters shows internal structure of the data Find patterns, structure, etc. Sometimes the partitioning is the goal, e.g. market segmentation As a first step towards predictive techniques, e.g. mine decison tree model on cluster labels to get further insights or even predict group outcome; use labels and distances as features 34
Clustering 35
Two types Hierarchical clustering: Create a hierarchical decomposition of the set of objects using some criterion Connectivity, distance based Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions Partitional clustering: Objective function based Construct various partitions and then evaluate them by some criterion K-means, k++ means, etc 36
Hierarchical clustering Agglomerative hierarchical clustering: starting with single elements and aggregating them into clusters, bottom-up Divisive hierarchical clustering: starting with the complete data set and dividing it into partitions, top-down 37
Hierarchical clustering In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of (dis)similarity between sets of observations is required … can be quite subjective In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets 38
Hierarchical clustering The distance metric defines the distance between two observations Euclidean distance: || a − b || 2 = √∑ i ( a i − b i ) 2 Squared Euclidean distance: || a − b || 2 2 = ∑ i ( a i − b i ) 2 Manhattan distance: || a − b || 1 = ∑ i | a i − b i | Maximum distance: || a − b || ∞ = max i | a i − b i | Many more are possible… Should possess: Symmetry: d ( a , b ) = d ( b , a ) Constancy of self-similarity: d ( a , a ) = 0 Positivity: d ( a , b ) ≥ 0 Triangular inequality: d ( a , b ) ≤ d ( a , c ) + d ( c , b ) 39
Hierarchical clustering The linkage criterion defines the distance between groups of instances Note that a group can consist of only one instance Single linkage (minimum linkage): D ( A , B ) = min( d ( a , b | a ∈ A , b ∈ B ) Leads to longer, skinnier clusters Complete linkage (maximum linkage): D ( A , B ) = max( d ( a , b | a ∈ A , b ∈ B ) Leads to tight clusters 40
Hierarchical clustering The linkage criterion defines the distance between groups of instances Note that a group can consist of only one instance Average linkage (mean linkage): 1 D ( A , B ) = ∑ a ∈ A , b ∈ B d ( a , b ) | A |×| B | Favorable in most cases, robust to noise Centroid linkage: based on distance between the centroids: D ( A , B ) = d ( a c , b c ) Also robust, requires definition of a centroid concept 41
Hierarchical clustering A hierarchy is obtained over the instances No need to specify desired amount of clusters in advance Hierarchical structure maps nicely to human intuition for some domains Not very scalable (though fast enough in most cases) Local optima are a problem: cluster hierarchy not necessarily the best one You might want/need to normalize your features first 42
Hierarchical clustering Represent hierarchy in dendrogram Can be used to decide on number of desired clusters (a “cut” in the dendrogram) 43
Hierarchical clustering Represent hierarchy in dendrogram Also a good indication of possible outliers or anomalies 44
Agglomerative or divisive? All implementations you’ll find will implement agglomerative clustering Divisive clustering turns out the be way more computationally expensive: don’t bother! 45
Partitional clustering Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters Since only one set of clusters is output, the user normally has to input the desired number of clusters k Most well-known algorithm: k-means clustering: 1. Decide on a value for k 2. Initialize k cluster centers (e.g. randomly over the feature space or by picking some instances randomly) 3. Decide the cluster memberships of the instances by assigning them to the nearest cluster center, using some distance measure 4. Recalculate the k cluster centers 5. Repeat 3 and 4 until none of the objects changed cluster membership in the last iteration 46
K-means example Pick random centers (either in the feature space or by using some randomly chosen instances as starts): 47
K-means example Calculate the membership of each instance: 48
K-means example And reposition the cluster centroids: 49
K-means example Reassign the instances again: 50
K-means example Recalculate the centroids: No reassignments performed, so we stop here 51
K-means Recall that a good cluster solution has high intra-cluster (in-cluster) similarity and low inter- cluster (between-cluster) similarity K-means optimizes for high intra-cluster similarity by optimizing towards a minimal total distortion: the sum of square distances of points to their cluster centroid n k K || x ki − μ k || 2 min SSE = ∑ ∑ k =1 i =1 Note: an exact optimization of this objective function is hard, k -means is a heuristic approach not necessarily providing the global optimal outcome Except in the one-dimensional case 52
K-means Strengths Very simple to implement and debug Intuitive objective function: optimizes intra-cluster similarity Relatively efficient Weaknesses Applicable only when mean is defined (to calculate centroid of points) What about categorical data? Often terminates at a local optimum, hence initialization is extremely important try multiple random starts (most implementations do this by default) or use K-means++ (most implementations do so) Need to specify k in advance: use elbow point if unsure (e.g. plot SSE across different values for k ) Sensitive to handle noisy data and outliers Again: normalization might be required Not suitable to discover clusters with non-convex shapes 53
K-means and non-convex shapes 54
K-means and local optima 55
K-means and setting the value for k 56
K-means and setting the value for k Don’t be too swayed by your intuition in this case Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind 57
K-means and setting the value for k Don’t be too swayed by your intuition in this case Often, it pays off to start with a higher setting for k and post-process/inspect accordingly, even if you already have a number of optimal clusters in mind 58
K-means++ An algorithm to pick good initial centroids 1. Choose first center uniformly at random (in the feature space or from the instances) 2. For each instance x (over a grid in the feature space), compute : the distance between x d ( x , c ) and (nearest) center c that has already been defined 3. Choose another center, using a weighted probability distribution where a point x is chosen with probability proportional to d ( x , c ) 2 4. Repeat Steps 2 and 3 until k centers have been chosen 5. Now proceed using standard k-means clustering (not repeating the initialization step, of course) Basic idea: spread the initial clusters out from each other: try lower inter-cluster similarity Most k-means implementations actually implement this variant 59
DBSCAN “Density-based spatial clustering of applications with noise” Groups together points that are closely packed together (points with many nearby neighbors) Marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away) DBSCAN is also one of the most common clustering algorithms Hierarchical versions exist as well 60
DBSCAN https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/ 61
Expectation-maximization based clustering Based on Gaussian mixture models K-means is EM’ish, but makes “hard” assignments of instances to clusters Based on two steps: E-step: assign probabilistic membership: probability of membership to a cluster given an instance in dataset M-step: re-estimate parameters of model based on the probabilistic membership Which model are we estimating? A mixture of Gaussians 62
Expectation-maximization based clustering 63
Expectation-maximization based clustering Also local optimization, but nonetheless robust Learns a model for each cluster So you can generate new data points from it (though this is possible with k-means as well: just fit a gaussian for each cluster with mean on the centroid and variance derived using min sum of squared errors) Relatively efficient Extensible to other model types (e.g. multinomial models for categorical data, or noise-robust models) But: Initialization of the models still important (local optimum problem) Also still need to specify k Also problems with non-convex shapes So basically: a less “hard” version of k-means 64
Mean-shift clustering Mean shift builds upon the concept of kernel density estimation (KDE): asumme data was sampled from a probability distribution. KDE is a method to estimate the underlying distribution Works by placing a kernel on each point in the data set (kernel here: a weighting function); most popular one is the Gaussian kernel Adding all of the individual kernels up generates a probability surface (e.g., density function) 65
Mean-shift clustering https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/ 66
Mean-shift clustering Mean shift uses KDE idea: how would the points move if they climbed up using the nearest peak of de KDE surface: iteratively shifting each point uphill until it reaches a peak Depending on the kernel bandwidth used, the KDE surface (and clustering result) will be different E.g. for tall skinny kernels (e.g., a small kernel bandwidth), the resultant KDE surface will have a peak for each point. This will result in each point being placed into its own cluster For an extremely short wide kernels (e.g., a large kernel bandwidth), this will result in a wide smooth KDE surface with one peak that all of the points will climb up to, resulting in one cluster 67
Validation How to check the result of a clustering run? Use k-means objective function (sum of squared errors) Many other measures as well, e.g. (Davies and Bouldin, 1979), (Halkidi, Batistakis, Vazirgiannis 2001) These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters E.g. Davies Bouldin Index: ∑ N 1 DBI N = i =1 R i N S i + S j R i = max j ≠ i R ij with i = 1, . . . , N and R ij = D ij : mean distance between instances in cluster i and its centroid S i : distance between centroids of clusters i and j D ij A good result has a low index score But: given the descriptive nature, validation here can be somewhat subjective 68
Applications Market research: customer and market segments, product positioning Social network analysis: recognize communities of people Social science: identify students, employees with similar properties Search: group search results Recommender systems: predict preferences based on user’s cluster Crime analysis: identify areas with similar crimes Image segmentation and color palette detection (e.g. recall LIME on images) 69
Domain knowledge An interesting question for many algorithms is how domain knowledge can be incorporated in the learning step For clustering, this is often done using must-link and can’t-link constraints: who should be and should not be in the same cluster? Nice as this does not require significant changes to algorithm, but can lead to infeasible solutions if too many constraints are added Another approach is a “warm start” solution by providing a partial solution 70
More on distance metrics Most metrics we’ve seen were defined for numerical features, though these exist for textual and categorical data as well E.g. Levenshtein distance between to text fields Based on the number of edits (changes) performed: deletions, insertions and substitutions Other metrics exist as well, i.e. Jaccard, cosine, Gower distance (https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using- gower-distance-ab89b3aa90d9) 71
Self-organizing maps (SOMs) Can be formalized as a special type of artificial neural networks Not really (only) clustering, but: unsupervised Produces a low-dimensional representation of the input space (just like PCA!) Also called a Kohonen map (Teuvo Kohonen) https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/ Not that often used anymore, but this brings us to… 72
Dimensionality Reduction 73
PCA http://setosa.io/ev/principal-component-analysis/ Principal components calculated by making use of eigenvector decomposition of the covariance matrix of the data PC j = e ′ j X = e j 1 X 1 + e j 2 X 2 + ⋯ + e jp X p (Eigenvalue represents the explained variance) λ j Pro: powerful data reduction tool and principal components are uncorrelated Con: PCA may be difficult to interpret, linear approach 74
t-SNE t-Distributed Stochastic Neighbor Embedding L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-SNE, Journal of Machine Learning Research, 9(Nov):2579-2605, 2008 t-SNE is a dimensionality reduction technique Comparable to PCA t-SNE seeks to preserve local similarities (small pairwise distances) t-SNE is a non-linear dimensionality reduction technique based on manifold learning Assumes data points lie on embedded non-linear manifold within higher-dimensional space Manifold is topological space that locally resembles Euclidean space near each data point Example: A surface as a 2D manifold, locally resembling Euclidean plane near each data point A 3D manifold which can be described by collection of 2D manifolds Higher dimensional space can thus be well “embedded” in lower dimensional space Other manifold dimensionality reduction techniques Multi Dimensional Scaling (MDS) Isomap Locally linear embedding (LLE) Auto-encoders 75
t-SNE Works in two steps: 1. Probability distribution representing similarity measure over pairs of high- dimensional data points is constructed 2. Similar probability distribution over data points in low-dimensional map is constructed “Similar” using Kullback–Leibler divergence (aka information gain, relative entropy): divergence between two distributions is minimized −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e with N dimensionality of data p j | i + p i | j p ij = 2 N Assumption: p i | i = p ii = 0 76
t-SNE Step 1: measure similarities between data points in the original (high) dimensional space Similarity of to is conditional probability, that would pick as its neighbor if neighbors were picked x j x i p j | i x i x j in proportion to their probability density assuming a Gaussian kernel centered at x i Then measure density of all other data points under Gaussian distribution and normalize 77
t-SNE −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e with N =2 p j | i + p i | j p ij = 2 N Suppose we have 3 data points: , and x i x j x k We then compute the conditional and joint probabilities using the formulas to the left Obviously a toy example as “reducing” two dimensions doesn’t make a lot of sense 78
t-SNE −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e with N =2 p j | i + p i | j p ij = 2 N Nominator: Gaussian distribution centered at x i p j | i = 0.78/ z p k | i = 0.60/ z 79
t-SNE −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e with N =2 p j | i + p i | j p ij = 2 N Denominator: these similarity measures are normalized against all points, except itself x i p j | i = 0.78/55.62 80
t-SNE −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e with N =2 p j | i + p i | j p ij = 2 N Finally, we can compute the joint probabilities ← more p ij = 0.0069 similar p ik = 0.0049 81
t-SNE −|| xi − xj ||2 2 σ 2 i e p j | i = −|| xi − xk ||2 2 σ 2 i ∑ k ≠ i e Based on Euclidean distance (bandwith of the Gaussian kernel) is data point dependent! σ i Set based upon the perplexity which is a measure to estimate how well the distribution predicts a sample is then set in such a way that the perplexity of the conditional distribution equals a σ i predefined perplexity As a result, the bandwidth parameter is adapted to the density of the data: smaller values are σ i used in denser parts of the data space This is one of the key user-specified hyperparameters of t-SNE 82
t-SNE Step 2: measure similarities between data points in low dimensional space insteaf of between mapped points q ij p ij y i , y j , … Student t-distribution used to measure similarities with degrees of freedom = dimensionality in mapped space – 1 Student t-distribution has fatter tails than Gaussian distribution Assumption: q i | i = q ii = 0 No perplexity parameter 83
t-SNE Next, the distances between and are considered p ij q ij obviously depends on how we place the data points in the mapped low dimensional space q ij So the locations are determined by minimizing Kullback-Leibler divergence: p ij log p ij KL ( P || Q ) = ∑ q ij i ≠ j Optimized using standard gradient descent 84
t-SNE t-SNE shines when dealing with high dimensional data E.g. images or word documents 85
t-SNE t-SNE itself is not a clustering technique A 2nd-level clustering can however be easily performed on the mapped space, using for example, k-means clustering, DBSCAN or other clustering techniques Of course, you can also use the mapped coordinates as new instance features But: be careful – https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of- t-sne/264647#264647 86
t-SNE No theoretical reason, but most implementations only allow lower-dimensional space to be 2D or 3D (computational costs) Also, t-SNE learns non-parametric mapping There is no explicit function that maps data from input space to map Not possible to embed unseen test points in existing map – so featurization is difficult, and t-SNE less suitable as dimensionality reduction technique in predictive setup Extensions exist however that learn multivariate regressor to predict map location from input data or construct regressor that minimizes t-SNE loss directly (“parametric t-SNE”) Also, you provide your own pairwise similarity matrix and do KL-minimization instead of using built-in conditional probability based similarity measure Diagonal elements should be 0 and should be normalized to sum to 1 Distance matrix can also be used: similarity = 1 - distance This avoids having to tune the perplexity parameter (but you will need to decide on a similarity metric) 87
t-SNE As t-SNE uses a gradient descent based approach, remarks regarding learning rates and initialization of mapped points apply E.g. initialization sometimes done using PCA Defaults usually work well See deep learning session for more on gradient descent and learning rates Most important hyperparameter is perplexity Knob that sets number of effective nearest neighbors (similar to k in k-NN) Perplexity value depends on density of data Denser dataset requires larger perplexity Typical values range between 5 and 50 Thinking point: couldn’t the conditional probability be set using a (weighted) k- NN? 88
t-SNE Impact of perplexity: neighborhood effectively considered! 89
t-SNE Different perplexity values can lead to very different results “Size” of “clusters” has no meaning Neither does “distance” between clusters (only locally: manifold) Random data can end up looking “meaningful” More examples at https://distill.pub/2016/misread-tsne/ 90
t-SNE Matlab (original release): https://lvdmaaten.github.io/tsne/ Now built-in in Matlab R ( tsne package): https://cran.r-project.org/web/packages/tsne/ Python ( scikit-learn ): https://scikit- learn.org/stable/modules/generated/sklearn.manifold.TSNE.html Julia, Java, Torch implementations also available Parametric version: https://github.com/kylemcdonald/Parametric-t-SNE 91
UMAP Uniform Manifold Approximation and Projection for Dimension Reduction McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 UMAP is a dimensionality reduction technique Comparable with Principal Component Analysis (PCA) and t-SNE Like t-SNE, UMAP is a non-linear dimensionality reduction technique based on manifold learning Newer (2018) but already well-proven in bioinformatics, materials science and machine learning Performs better in mappings with a dimensionality > 3 Can incorporate supervised labels in the construction of the mapping Parametric by default: better suited as a general-purpose preprocessing technique compared to t-SNE Good performance (scales better than t-SNE, MDS) Nicer properties in terms of interpretation when compared to t-SNE 92
UMAP Like t-SNE, UMAP is a manifold learner Recall: a manifold is a topological space that locally resembles Euclidean space near each point UMAP aims to construct a locally-connected Riemannian manifold with a locally constant Riemannian metric Study of Riemannian manifolds is a (hard) research area on its own, in what follows, we present a basic intuitive overview of UMAP 93
UMAP First, we need a concept from topology: simplicial complexes A means to construct topological spaces out of simple components Allows to reduce the complexities of dealing with the continuous geometry of topological spaces to relatively simple combinatorics and counting The basic building block is a simplex: a way to build a k dimensional object A k-simplex is constructed by taking the convex hull of k + 1 independent points Every k-simplex hence describes a simple combinatorial structure A k-simplex can be regarded as a set of k + 1 objects with faces E.g. tetrahedron (3-simplex) consists of 4 triangles 94
UMAP To construct topological spaces, simplices can be combined in a “simplicial complex” K A set of simplices glued together along faces Any face of any simplex in K is also in K (i.e. every sub-simplex of simplices in K are also part of the complex) The intersection of any two simplices in K is a face of both simplices Next, we will construct a Čech complex (which is a simplicial complex) given an open cover of a topological space An open cover is some family of sets whose union is the whole space A Čech complex coverts an open cover into a simplicial complex: let each set in the cover be represented with a 0-simplex (a single point). Create a 1-simplex between two such sets if they have a non-empty intersection; create a 2-simplex between three such sets if the triple intersection of all three is non-empty; and so on Topological theory provides guarantees about how well this simple process can produce something that represents the topological space itself in a meaningful way (Nerve theorem) 95
UMAP Let’s illustrate this on a toy example of a two-dimensional data set Assume samples are drawn from an underlying topological space To generate an open cover, we can simply create blobs with a fixed radius around each point as an approximation of the topological space (as we need to define intersections) 96
UMAP We can then construct a Čech complex Every data point can serve as the 0-simplex to expand from I.e. we get points, lines and triangles Similar in higher dimensional space (but harder to plot) Note that the simplicial complex relatively well captures the topology of the dataset In fact, most of the work is being done by the 0- and 1-simplices: points and lines You could argue that this will be the same in higher-dimensional space, i.e. why bother with triangles, tetrahedrons, …? Vietoris-Rips complex is similar to the Čech complex but is only determined by the 0- and 1-simplices: computationally easier and can be used instead! 97
UMAP The goal is now to construct a lower-dimensional mapping of the data that has a similar topological representation If we continue with a Vietoris-Rips complex, we basically get a graph with nodes (points) and edges (lines) We could then use any existing graph layout algorithm (e.g. based on spectral methods or force layout) to layout the graph structure to a 2d (or higher-order space) This is simple way to think about the basics of UMAP: construct a graph over the original data set and then reduce it using a layout algorithm to the lower-dimensional space Also see social network session later on UMAP does it differently, however… 98
UMAP An obvious difficulty is that picking a radius for the blobs around our data points is so far arbitrary Radius too small → the resulting simplicial complex splits into many connected components Radius too large → simplicial complex turns into just a few very high dimensional simplices and fails to capture the manifold structure anymore If the data would be uniformly distributed across the underlying topology, picking a radius would be easy and stable: 99
UMAP This assumption of uniform distribution turns up frequently in manifold learning (Laplacian eigenmaps, Nerve theorem, …) But real data is not Solution: assume that the data is uniformly distributed, but the notion of distance is varying across the (original) manifold We can compute (or at least approximate) a local notion of distance for each point by making use of Riemannian geometry: a unit blob around a point stretches to the k -th nearest neighbor of the point, where k is the sample size we are using to approximate the local sense of distance Each point is given its own unique distance function, and we construct point-local blobs Similar to point dependent bandwidth around points in t-SNE with fixed perplexity! E.g. for k = 2: 100
Recommend
More recommend