today
play

Today Problems with visualizing high dimensional data Problem - PowerPoint PPT Presentation

Today Problems with visualizing high dimensional data Problem Overview Visual cluttering Direct Visualization Approaches High dimensionality Classical methods Clarity of representation Dimensional anchors


  1. Today Problems with visualizing high dimensional data • Problem Overview • Visual cluttering • Direct Visualization Approaches High dimensionality Classical methods • Clarity of representation – Dimensional anchors – Scagnostic SPLOMs • Visualization is time consuming • Nonlinear Dimensionality Reduction Evgeny Maksakov – Locally Linear Embedding and Isomaps – Charting manifold CS533C Department of Computer Science UBC 1 2 3 4 Multiple Line Graphs Multiple Line Graphs Scatter Plot Matrices Scatter Plot Matrices Advantages and disadvantages: Advantages and disadvantages: - Hard to distinguish dimensions if multiple line graphs overlaid + Useful for looking at all possible two-way interactions between dimensions - Each dimension may have different scale that should be shown - Becomes inadequate for medium to high dimensionality - More than 3 dimensions can become confusing 5 6 7 8 Pictures from Patrick Hoffman et al. (2000) Pictures from Patrick Hoffman et al. (2000) Bar Charts, Histograms Bar Charts, Histograms Survey Plots Survey Plots Advantages and disadvantages: Advantages and disadvantages: + allows to see correlations between any two variables when the data is sorted according to one particular dimension + Good for small comparisons - can be confusing - Contain little data 9 10 11 12 Pictures from Patrick Hoffman et al. (2000) Pictures from Patrick Hoffman et al. (2000) Parallel Coordinates Parallel Coordinates Circular Parallel Coordinates Circular Parallel Coordinates Advantages and disadvantages: Advantages and disadvantages: + Many connected dimensions are seen in limited space + Combines properties of glyphs and parallel coordinates making pattern recognition easier + Can see trends in data + Compact - Become inadequate for very high dimensionality - Cluttering near center - Cluttering - Harder to interpret relations between each pair of dimensions than parallel coordinates 13 14 15 16 Pictures from Patrick Hoffman et al. (2000) Pictures from Patrick Hoffman et al. (2000)

  2. Andrews’ Curves Andrews’ Curves Radviz Radviz Advantages and disadvantages: Advantages and disadvantages: + Allows to draw virtually unlimited dimensions + Good for data manipulation - Hard to interpret + Low cluttering - Cannot show quantitative data - High computational complexity Radviz employs spring model 17 18 19 20 Pictures from Patrick Hoffman et al. (2000) Pictures from Patrick Hoffman et al. (2000) What is dimensional anchor? What is dimensional anchor? Attempt to Generalize Dimensional Anchors Visualization Methods Nothing like that for High Dimensional Data DA is just an axis line… � Anchorpoints are coordinates… � 21 22 23 24 Picture from members.fortunecity.com/agreeve/seacol.htm & http://kresby.grafika.cz/data/media/46/dimension.jpg_middle.jpg DA Visualization Vector Parameters of DA Parameters of DA Parameters of DA Radviz features Scatterplot features Survey plot feature 4. Width of the rectangle in a survey plot – Size of the scatter plot points 7. Size of the radviz plot point Parallel coordinates features P (p 1 ,p 2 ,p 3 ,p 4 ,p 5 ,p 6 ,p 7 ,p 8 ,p 9 ) 8. Length of “spring” lines extending from individual anchor points of radviz – Length of the perpendicular lines extending from individual anchor plot points in a scatter plot 5. Length of the parallel coordinate lines 9. Zoom factor for the “spring” constant K 6. Blocking factor for the parallel coordinate lines – Length of the lines connecting scatter plot points that are associated with the same data point 25 26 27 28 DA describes visualization for any Scatterplots Scatterplots with other layouts Survey Plots combination of: • Parallel coordinates • Scatterplot matrices • Radviz P = (0, 0, 0, 0.4, 0, 0, 0, 0, 0) P = (0, 0, 0, 1.0, 0, 0, 0, 0, 0) • Survey plots (histograms) 2 DAs, P = (0.1, 1.0, 0, 0, 0, 0, 0, 0, 0) 5 DAs, P = (0.5, 0, 0, 0, 0, 0, 0, 0, 0) 2 DAs, P = (0.8, 0.2, 0, 0, 0, 0, 0, 0, 0) 3 DAs, P = (0.6, 0, 0, 0, 0, 0, 0, 0, 0) • Circle segments 29 30 31 32 Picture from Patrick Hoffman et al. (1999) Picture from Patrick Hoffman et al. (1999) Picture from Patrick Hoffman et al. (1999)

  3. Circular Segments Parallel Coordinates Radviz like visualization Playing with parameters Parallel coordinates with Crisscross layout with P = (0, 0, 0, 0, 0, 0, 0.4, 0, 0.5) P = (0, 0, 0, 0, 0, 0, 0.4, 0, 0.5) P = (0, 0, 0, 0, 1.0, 1.0, 0, 0, 0) P = (0, 0, 0, 0, 0, 0, 0.5, 1.0, 0.5) P = (0, 0, 0, 1.0, 0, 0, 0, 0, 0) 33 34 35 36 Picture from Patrick Hoffman et al. (1999) Picture from Patrick Hoffman et al. (1999) Picture from Patrick Hoffman et al. (1999) Pictures from Patrick Hoffman et al. (1999) More? Scatterplot Diagnostics Tukey’s Idea of Scagnostics Scagnostic SPLOM Is like: • Take measures from scatterplot matrix • Visualization of a set of pointers or Also: • Construct scatterplot matrix (SPLOM) of these measures Scagnostics • Set of pointers to pointers also can be constructed • Look for data trends in this SPLOM Goal: • To be able to locate unusual clusters of measures that characterize unusual clusters of raw scatterplots 37 38 39 40 Pictures from Patrick Hoffman et al. (1999) Properties of geometric graph for Problems with constructing Solution* Graphs that fit these demands: Scagnostic SPLOM measures 1. Use measures from the graph-theory. 1) Some of Tukeys’ measures presume underlying continuous • Undirected • Convex Hull (edges consist of unordered pairs) empirical or theoretical probability function. It can be a – Do not presume a connected plane of support problem for other types of data. • Simple – Can be metric over discrete spaces (no edge pairs a vertex with itself) 2. Base the measures on subsets of the Delaunay • Planar • Alpha Hull (has embedding in R2 with no crossed edges) 2) The computational complexity of some of the Tukey triangulation • Straight measures is O ( n_ ). • Gives O(nlog(n)) in the number of points (embedded eges are straight line segments) 3. Use adaptive hexagon binning before computing to • Finite • Minimal Spanning Tree ( V and E are finite sets) further reduce the dependence on n . 4. Remove outlying points from spanning tree 41 42 43 44 * Leland Wilkinson et al. (2005) Looking for anomalies Five interesting aspects of scattered points: Measures: Classifying scatterplots • Outliers • Length of en edge – Outlying • Shape • Length of a graph – Convex – Skinny – Stringy • Look for a closed path (boundary of a polygon ) – Straight • Trend • Perimeter of a polygon – Monotonic • Density – Skewed • Area of a polygon – Clumpy • Coherence – Striated • Diameter of a graph 45 46 47 48 Picture from L. Wilkinson et al. (2005) Picture from L. Wilkinson et al. (2005)

  4. Manifold Methods Nonlinear Dimensionality Reduction (NLDR) Assumptions: Topological space that is “locally Euclidean”. • data of interest lies on embedded nonlinear manifold within higher dimensional space • Locally Linear Embedding • manifold is low dimensional � can be visualized in low dimensional space . • ISOMAPS 49 50 51 52 Picture from L. Wilkinson et al. (2005) Picture from: http://en.wikipedia.org/wiki/Image:KleinBottle-01.png Picture from: http://en.wikipedia.org/wiki/Image:Triangle_on_globe.jpg Locally Linear Embedding (LLE) Algorithm Application of LLE Isomaps Algorithm Original Sample Mapping by LLE 1. Construct neighborhood graph 2. Compute shortest paths 3. Construct d -dimensional embedding (like in MDS) 53 54 55 56 Picture from: Joshua B. Tenenbaum et al. (2000) Pictures taken from http://www.cs.wustl.edu/~pless/isomapImages.html Picture from Lawrence K. Saul at al. (2002) Picture from Lawrence K. Saul at al. (2002) Proposed improvements* Limitations of LLE Strengths and weaknesses: • Algorithm can only recover embeddings whose dimensionality, d , is • Analyze pairwise distances between data points instead strictly less than the number of neighbors, K. Margin between d and of assuming that data is multidimensional vector • ISOMAP handles holes well K is recommended. Charting manifold • Reconstruct convex • ISOMAP can fail if data hull is non-convex • Algorithm is based on assumption that data point and its nearest neighbors can be modeled as locally linear; for curved manifolds, too large K will violate this assumption. • Estimate the intrinsic dimensionality • Vice versa for LLE • In case of originally low dimensionality of data algorithm • Enforce the intrinsic dimensionality if it is known a priori degenerates. • Both offer embeddings without mappings. or highly suspected 57 58 59 60 * Lawrence K. Saul at al (2002) Where ISOMAPs and LLE fail, Algorithm Idea Video test Charting Prevail 1) Find a set of data covering locally linear neighborhoods (“charts”) such that adjoining neighborhoods span maximally similar subspaces 2) Compute a minimal-distortion merger (“connection”) of all charts 61 62 63 64 Picture from Matthew Brand (2003) Picture from Matthew Brand (2003) Picture from Matthew Brand (2003)

Recommend


More recommend