Courtesy of Prof. Shixia Liu @Tsinghua University Outline - PowerPoint PPT Presentation

Courtesy of Prof. Shixia Liu @Tsinghua University

Outline • Introduction • Classification of Techniques – Table – Scatter Plot Matrices – Projections – Parallel Coordinates • Summary

Motivation • Real world data contain multiple dimensions

Multivariate/Multidimensional Data Visualization • Multivariate data visualization is a specific type of information visualization that deals with multivariate/multidimensional data • The data to be visualized are of high dimensionality in which the correlations between these many attributes are of interest

Dimensionality • Refers to the number of attributes that presents in the data – 1: one-dimensional 1D / univariate – 2: two-dimensional 2D/ bivaraite – 3: three-dimensional 3D / trivariate – ≥ 3: multidimensional / hypervarite / multivariate • Boundary between high and low dimensionality not clear, generally high dimensionality has >4 variables

Terminology Dimensions Variables Multidimensional Dimensionality of the independent dimensions Multivariate Dimensionality of the dependent variables

Outline • Introduction • Classification of Techniques – Projections – Parallel Coordinates – Table – Scatter Plot Matrices • Summary

Classification of Techniques • Projection • Parallel Coordinates Plot • Table • Scatter Plot Matrix

• What if we have too many dimensions? • A intuitive way is to project to low dimension space • Linear projections • Nonlinear projections A projection (X -> Y) maps points { x 1 , x 2 , …, x m } in an n -dimensional space into a p-dimensional space as { y 1 , y 2 , …, y m } ( p << n ) while preserving distance measures of data items.

Classification • Linear projection – Example: PCA (principal component analysis) • Non-linear projection – Example: t-SNE (t-distributed stochastic neighbor embedding)

PCA • Seeks a space of lower dimensionality (magenta) • Such that the orthogonal projection of the data points (red) onto this subspace maximizes the variance of the projected points (green)

Maximizes Variance • To begin with, consider the projection onto a one- dimensional space • The direction of this space Trick: • Variance • How to maximize this?

Maximizes Variance (cont’d) • Eigenvalue

One Example

Extension to M-dimension • Define additional principal components in an incremental fashion (details refer to Chapter 12 in Patter Recognition and Machine Learning) • Conclusion of M dimension: • The M eigenvectors u 1 ,...,u M of the data covariance matrix S corresponding to the M largest eigenvalues λ 1 ,..., λ M

Covariance Matrix Covariance

Fit an n-d Ellipsoid to the Data

T-SNE • Particularly well-suited for embedding high- dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot

Major Goal • t-Distributed stochastic neighbor embedding (t- SNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding.

Two Main Stages • First, t-SNE constructs a probability distribution over pairs of high-dimensional objects – Similar objects have a high probability of being picked – Dissimilar points have an extremely small probability of being picked

Example – Step 1

Two Main Stages (cont’d) • Second, t-SNE defines a probability distribution over the points in the low-dimensional map – Similar to the one in high-dimensional space – Minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. Heavy-tailed student-t distribution

Example: Step Two

Example: Step Two Before optimization

Example: Final Result Gaussian distribution Student t-distribution

The t-Student distribution • The volume of the N-dimensional ball of radius r scales is • When N is large, if we pick random points uniformly in the ball, most points will be close to the surface, and very few will be near the center.

The t-Student distribution • If the same Gaussian distribution is used for the low dimensional map points ， not enough space is available in low dimensional space – The crowding problem • Use a t-Student with one degree of freedom (or Cauchy) distribution instead for the map points. – Has a much heavier tail than the Gaussian distribution, which compensates the original imbalance.

Comparison

The Distribution Model • Probability model for high-dimensional data points • Probability model for low-dimensional map points • The different between two distributions

The Solution • To minimize this score, we perform a gradient descent. The gradient can be computed analytically: • Update y i iteratively

One Example

Example: MNIST • Hand written digit (0-9)

Package • Laurens van der Maate https://lvdmaaten.github.io/tsne/ – L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct): 3221-3245, 2014. – L.J.P. van der Maaten and G.E. Hinton. Visualizing Non-Metric Similarities in Multiple Maps. Machine Learning 87(1):33-55, 2012. – L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence & Statistics (AI- STATS), JMLR W&CP 5:384-391, 2009. PDF – L.J.P. van der Maaten and G.E. Hinton. Visualizing High- Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008 .

Comparison • PCA, MDS – Linear technique – Keep the low-dimensional representations of dissimilar data points far apart • t-SNE – Non-linear technique – Capture much of the local structure of the high- dimensional data very well, while also revealing global structure such as the presence of clusters at multiple scales.

Comparison

• Inselberg ， "Multidimensional detective" (parallel coordiantes), 1997

Parallel Coordinates: Visual Design • Dimensions as parallel axes Max: 1 0.8 0.8 • Data items as line segments 0.6 • Intersections on the axes 0.3 indicates the values of the 0.25 corresponding attributes Min: 0 dim1 dim2 dim3 …… dimn

Parallel Coordinates: Pros and Cons � � Correlations among attributes studied by spotting the locations of the intersection points � � Effective for revealing data distributions and functional dependencies � � Visual clutter due to limited space available for each parallel axis � � Axes packed very closely when dimensionality is high

• Clustering and filtering approaches • Dimension reordering approaches • Visual enhancement approaches Out5d dataset (5 dimensions, 16384 data items)

Star Coordinates • Scatterplots for higher dimensions: attribute as axis on a circle, data item as point • Change the length of axis � alters contribution of attribute • Change the direction of axis � angles not equal, adjusts correlations between attributes �� Useful for gaining insight into hierarchically clustered datasets and for multi-factor analysis for decision-making

Table Lens • Represents rows as data items and columns as attributes • Each column viewed as histogram or plot • Information along rows or columns interrelated � Uses the familiar concept “ table ” � The table lens: merging graphical and symbolic representations in an interactive focus+ context visualization for tabular information

Scatterplot Matrix • Scatterplot: 2 attributes projected along the x- and y-axis • Collection of scatterplots is organized in a matrix � � Straightforward � � � Important patterns in higher dimensions barely recognized � Chaotic when number of � data items too large

Outline • Introduction • Classification of Techniques – Table – Scatter Plot Matrices – Projections – Parallel Coordinates – Pixel-Oriented Techniques – Iconography • Summary

Visualizations Advantages Disadvantages 1. Obscured semantics Clear visual patterns 2. Loss of information 3. Visual Clutter Clear visual patterns Visual Clutter Uses the familiar concept Support limited numbers “table” of dimensions 1. Visual clutter Simple 2. Unclear patterns

Further Reading • Survey – Dos Santos, Selan, and Ken Brodlie. "Gaining understanding of multivariate and multidimensional data through visualization." Computers & Graphics 28.3 (2004): 311-325. • Website – http://www.sci.utah.edu/~shusenl/highDimSurvey/ website/

Further Reading • Evaluation – Rubio-Sánchez, Manuel, et al. "A comparative study between RadViz and Star Coordinates." IEEE transactions on visualization and computer graphics 22.1 (2016): 619-628.

Courtesy of Prof. Shixia Liu @Tsinghua University Outline - PowerPoint PPT Presentation

Courtesy of Prof. Shixia Liu @Tsinghua University Outline Introduction Classification of Techniques Table Scatter Plot Matrices Projections Parallel Coordinates Summary Motivation Real world data contain multiple

Courtesy of Prof. Shixia Liu @Tsinghua University Introduction Node-Link diagrams

Visualization for Rich Text Corpora Nan Cao, Jimeng Sun, Yu-Ru Lin, David Gotz Shixia Liu,

Carbon Capture, Utilization, and Storage Presentation to the USEA Courtesy NREL Courtesy GRC

Low Tide Courtesy: Interboro Rebuild by Design Team Courtesy: Interboro Rebuild by Design Team

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua

Topological Complexity for Quantum Information Zhengwei Liu Tsinghua University Joint with

business need? improve image courtesy of iStockphoto recurrent image courtesy of Microsoft

Courtesy: SCAPE / Landscape Architecture Rebuild by Design Team Courtesy: SCAPE / Landscape

Courtesy: MIT CAU + ZUS + URBANISTEN Rebuild by Design Team Courtesy: MIT CAU + ZUS + URBANISTEN

The LIU Libraries Ethnographic Study: Librarians-turned-Ethnographers Share their Experiences

AMiner-mini: A People Search Engine For University Jingyuan Liu, Debing Liu, Xingyu Yan*, Li

William N. N. Hung Synopsys Inc. Xiaoyu Song Portland State University Ming Gu Tsinghua

Literature Review of Risks and Returns of Cryptocurrency by Liu and Tsyvinski, 2018 Jiawen Yan

Task Understanding From Confusing Multi-task Data Yizhou JIANG Shangqi GUO Feng CHEN Xin SU

Stephenie Kramer Willamette Cultural Resources Associates, Ltd. Photo courtesy of DAHP Photos

The Branches of the Government Lanca Amankwaah Fourth Grade Mrs. Francolino Courtesy of

Massive MIMO for the New Radio Overview and Performance Dr. Amitabha Ghosh Nokia Bell Labs

NetBeam: Network of Distributed Presenter: Carlos Bocanegra Full-dimension Beamforming

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

Scaling real-time analytics using Postgres in the cloud Sai.Srirampur@microsoft.com

Reconstruction of Multi-user Binary Subspace Chirps 2020 IEEE International Symposium on

Fast Sparse Spectral Methods for Higher Dimensional PDEs Jie Shen Purdue University

A class of Laplacian multi-wavelets bases for high-dimensional data Nir Sharon Tel-Aviv

HISTORY OF INFORMATION: THE INTERNET Megan Finn Lecture 14 05 Aug 2009 But it did

Courtesy of Prof. Shixia Liu @Tsinghua University Outline - PowerPoint PPT Presentation

Courtesy of Prof. Shixia Liu @Tsinghua University Outline Introduction Classification of Techniques Table Scatter Plot Matrices Projections Parallel Coordinates Summary Motivation Real world data contain multiple

Courtesy of Prof. Shixia Liu @Tsinghua University Introduction Node-Link diagrams

Visualization for Rich Text Corpora Nan Cao, Jimeng Sun, Yu-Ru Lin, David Gotz Shixia Liu,

Carbon Capture, Utilization, and Storage Presentation to the USEA Courtesy NREL Courtesy GRC

Low Tide Courtesy: Interboro Rebuild by Design Team Courtesy: Interboro Rebuild by Design Team

Human-in-the-loop Data Integration Guoliang Li Department of Computer Science, Tsinghua

Topological Complexity for Quantum Information Zhengwei Liu Tsinghua University Joint with

business need? improve image courtesy of iStockphoto recurrent image courtesy of Microsoft

Courtesy: SCAPE / Landscape Architecture Rebuild by Design Team Courtesy: SCAPE / Landscape

Courtesy: MIT CAU + ZUS + URBANISTEN Rebuild by Design Team Courtesy: MIT CAU + ZUS + URBANISTEN

The LIU Libraries Ethnographic Study: Librarians-turned-Ethnographers Share their Experiences

AMiner-mini: A People Search Engine For University Jingyuan Liu*, Debing Liu*, Xingyu Yan*, Li

William N. N. Hung Synopsys Inc. Xiaoyu Song Portland State University Ming Gu Tsinghua

Literature Review of Risks and Returns of Cryptocurrency by Liu and Tsyvinski, 2018 Jiawen Yan

Task Understanding From Confusing Multi-task Data Yizhou JIANG Shangqi GUO Feng CHEN Xin SU

Stephenie Kramer Willamette Cultural Resources Associates, Ltd. Photo courtesy of DAHP Photos

The Branches of the Government Lanca Amankwaah Fourth Grade Mrs. Francolino Courtesy of

Massive MIMO for the New Radio Overview and Performance Dr. Amitabha Ghosh Nokia Bell Labs

NetBeam: Network of Distributed Presenter: Carlos Bocanegra Full-dimension Beamforming

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

Scaling real-time analytics using Postgres in the cloud Sai.Srirampur@microsoft.com

Reconstruction of Multi-user Binary Subspace Chirps 2020 IEEE International Symposium on

Fast Sparse Spectral Methods for Higher Dimensional PDEs Jie Shen Purdue University

A class of Laplacian multi-wavelets bases for high-dimensional data Nir Sharon Tel-Aviv

HISTORY OF INFORMATION: THE INTERNET Megan Finn Lecture 14 05 Aug 2009 But it did

AMiner-mini: A People Search Engine For University Jingyuan Liu, Debing Liu, Xingyu Yan*, Li

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing