Dimensionality Reduc1on Lecture 23 David Sontag New York

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Dimensionality reduc9on • Input data may have thousands or millions of dimensions! – e.g., text data has ???, images have ??? • Dimensionality reduc1on : represent data with fewer dimensions – easier learning – fewer parameters – visualiza9on – show high dimensional data in 2D – discover “intrinsic dimensionality” of data • high dimensional data that is truly lower dimensional • noise reduc9on

!"#$%&"'%()$*+,-"'% � .&&+#/-"'%0(*1-1(21//)'3"#1-$456(4"$&('%( 1(4'7$)(*"#$%&"'%14(&/1,$ � 831#/4$&0 n = 2 n = 3 k = 1 k = 2 Slide from Yi Zhang

Example (from Bishop) • Suppose we have a dataset of digits (“3”) perturbed in various ways: • What opera9ons did I perform? What is the data’s intrinsic dimensionality? • Here the underlying manifold is nonlinear

Lower dimensional projec9ons • Obtain new feature vector by transforming the original features x 1 … x n z 1 = w (1) w (1) ⌥ + x i ⌥ 0 In general will not be i … inver9ble – cannot go i from z back to x z k = w ( k ) w ( k ) ⌥ + x i 0 i i • New features are linear combina9ons of old ones • Reduces dimension when k<n • This is typically done in an unsupervised seZng – just X , but no Y

Which projec9on is be[er? From notes by Andrew Ng

Reminder: Vector Projec9ons • Basic defini9ons: – A.B = |A||B|cos θ • Assume |B|=1 (unit vector) – A.B = |A|cos θ – So, dot product is length of projec9on!

Using a new basis for the data • Project a point into a (lower dimensional) space: – point : x = (x 1 ,…,x n ) – select a basis – set of unit (length 1) basis vectors ( u 1 ,…, u k ) • we consider orthonormal basis: – u j • u j =1, and u j • u l =0 for j ≠ l – select a center – x , defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z 1 ,…,z k ), z j i = ( x i - x ) • u j

Maximize variance of projec9on Let x (i) be the i th data point minus the mean. Choose unit-length u to maximize: m m 1 1 Covariance ( x ( i ) T u ) 2 u T x ( i ) x ( i ) T u � � = matrix Σ m m i =1 i =1 � � m 1 x ( i ) x ( i ) T � u T = u. m i =1 Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board)

Basic PCA algorithm [Pearson 1901, Hotelling, 1933] • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Compute covariance matrix: – Σ ← 1/m X c T X c • Find eigen vectors and values of Σ • Principal components: k eigen vectors with highest eigen values

PCA example Data: Projection: Reconstruction:

Dimensionality reduc9on with PCA In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. m 1 X ( z i j ) 2 var( z j ) = m 25 Percentage of total variance captured i =1 m by dimension z j for j=1 to 10: 1 λ j X ( x i · u j ) 2 = 20 P n l =1 λ l m i =1 Variance (%) = λ j 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You might lose some information, but if the eigenvalues �� much 23 Slide from Aarti Singh

Eigenfaces [Turk, Pentland ’91] • Input images: � Principal components:

Eigenfaces reconstruc9on • Each image corresponds to adding together (weighted versions of) the principal components:

Scaling up • Covariance matrix can be really big! – Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow… • Use singular value decomposi9on (SVD) – Finds k eigenvectors – great implementa9ons available, e.g., Matlab svd

SVD • Write X = Z S U T – X ← data matrix, one row per datapoint – S ← singular value matrix, diagonal matrix with entries σ i • Rela9onship between singular values of X and eigenvalues of Σ given by λ i = σ i 2 /m – Z ← weight matrix, one row per datapoint • Z 9mes S gives coordinate of x i in eigenspace – U T ← singular vector matrix • In our seZng, each row is eigenvector u j

PCA using SVD algorithm • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Call SVD algorithm on X c – ask for k singular vectors • Principal components: k singular vectors with highest singular values (rows of U T ) – Coefficients: project each point onto the new vectors

Non-linear methods � A%&,'- /)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C? � E"&4%&,'- !"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? 12 Slide from Aarti Singh

Isomap Es9mate manifold using Goal: use geodesic Embed onto 2D plane graph. Distance between distance between points so that Euclidean distance points given by distance of (with respect to manifold) approximates graph shortest path distance [Tenenbaum, Silva, Langford. Science 2000]

Isomap Table 1. The Isomap algorithm takes as input the distances d X (i , j ) between all pairs i , j from N data points in the high-dimensional input space X , measured either in the standard Euclidean metric (as in Fig. 1A) or in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors y i in a d -dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the data. The only free parameter ( � or K ) appears in Step 1. Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by d X ( i , j )] they are closer than � ( � -Isomap), or if i is one of the K nearest neighbors of j ( K -Isomap). Set edge lengths equal to d X ( i , j ). 2 Compute shortest paths Initialize d G ( i , j ) � d X ( i , j ) if i , j are linked by an edge; d G ( i , j ) � � otherwise. Then for each value of k � 1, 2, . . ., N in turn, replace all entries d G ( i , j ) by min{ d G ( i , j ), d G ( i , k ) � d G ( k , j )}. The matrix of final values D G � { d G ( i , j )} will contain the shortest path distances between all pairs of points in G ( 16 , 19 ). 3 Construct d -dimensional embedding Let � p be the p -th eigenvalue (in decreasing order) of i be the i -th the matrix � ( D G ) ( 17 ), and v p component of the p -th eigenvector. Then set the p -th component of the d -dimensional coordinate i . vector y i equal to �� p v p

Isomap [Tenenbaum, Silva, Langford. Science 2000]

Isomap Swiss roll data Face images PCA Residual variance Isomap Number of dimensions

What you need to know • Dimensionality reduc9on – why and when it’s important • Principal component analysis – minimizing reconstruc9on error – rela9onship to covariance matrix and eigenvectors – using SVD • Non-linear dimensionality reduc9on

Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 1

Probabilistic modeling Given: several variables: x 1 , . . . x n , n is large. Task: build a joint distribution function Pr( x 1 , . . . x n ) Goal: Answer several kind of projection queries on the distribution Basic premise ◮ Explicit joint distribution is dauntingly large ◮ Queries are simple marginals (sum or max) over the joint distribution. 2

Examples of Joint Distributions So far Naive Bayes: P ( x 1 , . . . x d | y ) , d is large. Assume conditional independence. Multivariate Gaussian Recurrent Neural Networks for Sequence labeling and prediction 3

Example Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: 10 × 7 × 7 × 3 × 30 = 44100. Queries: Estimate fraction of people with ◮ Income > 200K and Degree=”Bachelors”, ◮ Income < 200K, Degree=”PhD” and experience > 10 years. ◮ Many, many more. 4

Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies ◮ Many highly correlated pairs income �⊥ ⊥ age, income �⊥ ⊥ experience, age �⊥ ⊥ experience ◮ Ad hoc methods of combining these into a single estimate Go beyond pairwise correlations: conditional independencies ◮ income �⊥ ⊥ age, but income ⊥ ⊥ age | experience ◮ experience ⊥ ⊥ degree, but experience �⊥ ⊥ degree | income Graphical models make explicit an efficient joint distribution from these independencies 5

Dimensionality Reduc1on Lecture 23 David Sontag New York - PowerPoint PPT Presentation

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer Dimensionality reduc9on Input data may have thousands or millions of dimensions! e.g., text data has ???,

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduc1on Lecture 9 David Sontag New York

Dimensionality Reduc1on contd Aarti Singh Machine Learning 10-601 Nov 10,

Dimensionality Reduc1on Machine Learning 10-601B Seyoung Kim

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Massachuse(s)Toxics)Use)Reduc1on)Act) (TURA):)Reducing)the)Use)of)Carcinogens) Rachel'Massey'

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization Maxim Raginsky and

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

CS395T paper review Indoor Segmentation and Support Inference from RGBD Images Chao Jia Sep

D U E o i r ud ig el it i R o e t Riemannian Holonomy. To a Riemannian manifold ( M n

Pet Business John Hanson President, Pet Consumer Products Central plays in ~$28 B of Strong

and Measurements Using Semantic Technologies Student: Alexandra Moraru Mentor: Prof. Dr. Dunja

LOCAL GEOMETRY OF NONREGULAR CARNOT-CARATH EODORY SPACES AND APPLICATIONS TO NONLINEAR CONTROL

Derivation of a bedload transport model with viscous effects E. Audusse, L. Boittin, M. Parisot,

COMP 322: Fundamentals of Parallel Programming Lecture 2: Computation Graphs, Ideal Parallelism

Chapra, L14 (cont.) David A. Reckhow CEE 577 #10 1 Longitudinal Dispersion From Fischer et