Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer
Dimensionality reduc9on • Input data may have thousands or millions of dimensions! – e.g., text data has ???, images have ??? • Dimensionality reduc1on : represent data with fewer dimensions – easier learning – fewer parameters – visualiza9on – show high dimensional data in 2D – discover “intrinsic dimensionality” of data • high dimensional data that is truly lower dimensional • noise reduc9on
!"#$%&"'%()$*+,-"'% � .&&+#/-"'%0(*1-1(21//)'3"#1-$456(4"$&('%( 1(4'7$)(*"#$%&"'%14(&/1,$ � 831#/4$&0 n = 2 n = 3 k = 1 k = 2 Slide from Yi Zhang
Example (from Bishop) • Suppose we have a dataset of digits (“3”) perturbed in various ways: • What opera9ons did I perform? What is the data’s intrinsic dimensionality? • Here the underlying manifold is nonlinear
Lower dimensional projec9ons • Obtain new feature vector by transforming the original features x 1 … x n z 1 = w (1) w (1) ⌥ + x i ⌥ 0 In general will not be i … inver9ble – cannot go i from z back to x z k = w ( k ) w ( k ) ⌥ + x i 0 i i • New features are linear combina9ons of old ones • Reduces dimension when k<n • This is typically done in an unsupervised seZng – just X , but no Y
Which projec9on is be[er? From notes by Andrew Ng
Reminder: Vector Projec9ons • Basic defini9ons: – A.B = |A||B|cos θ • Assume |B|=1 (unit vector) – A.B = |A|cos θ – So, dot product is length of projec9on!
Using a new basis for the data • Project a point into a (lower dimensional) space: – point : x = (x 1 ,…,x n ) – select a basis – set of unit (length 1) basis vectors ( u 1 ,…, u k ) • we consider orthonormal basis: – u j • u j =1, and u j • u l =0 for j ≠ l – select a center – x , defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z 1 ,…,z k ), z j i = ( x i - x ) • u j
Maximize variance of projec9on Let x (i) be the i th data point minus the mean. Choose unit-length u to maximize: m m 1 1 Covariance ( x ( i ) T u ) 2 u T x ( i ) x ( i ) T u � � = matrix Σ m m i =1 i =1 � � m 1 x ( i ) x ( i ) T � u T = u. m i =1 Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board)
Basic PCA algorithm [Pearson 1901, Hotelling, 1933] • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Compute covariance matrix: – Σ ← 1/m X c T X c • Find eigen vectors and values of Σ • Principal components: k eigen vectors with highest eigen values
PCA example Data: Projection: Reconstruction:
Dimensionality reduc9on with PCA In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. m 1 X ( z i j ) 2 var( z j ) = m 25 Percentage of total variance captured i =1 m by dimension z j for j=1 to 10: 1 λ j X ( x i · u j ) 2 = 20 P n l =1 λ l m i =1 Variance (%) = λ j 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You might lose some information, but if the eigenvalues �������������������������� much 23 Slide from Aarti Singh
Eigenfaces [Turk, Pentland ’91] • Input images: � Principal components:
Eigenfaces reconstruc9on • Each image corresponds to adding together (weighted versions of) the principal components:
Scaling up • Covariance matrix can be really big! – Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow… • Use singular value decomposi9on (SVD) – Finds k eigenvectors – great implementa9ons available, e.g., Matlab svd
SVD • Write X = Z S U T – X ← data matrix, one row per datapoint – S ← singular value matrix, diagonal matrix with entries σ i • Rela9onship between singular values of X and eigenvalues of Σ given by λ i = σ i 2 /m – Z ← weight matrix, one row per datapoint • Z 9mes S gives coordinate of x i in eigenspace – U T ← singular vector matrix • In our seZng, each row is eigenvector u j
PCA using SVD algorithm • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Call SVD algorithm on X c – ask for k singular vectors • Principal components: k singular vectors with highest singular values (rows of U T ) – Coefficients: project each point onto the new vectors
Non-linear methods � A%&,'- /)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C? � E"&4%&,'- !"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? 12 Slide from Aarti Singh
Isomap Es9mate manifold using Goal: use geodesic Embed onto 2D plane graph. Distance between distance between points so that Euclidean distance points given by distance of (with respect to manifold) approximates graph shortest path distance [Tenenbaum, Silva, Langford. Science 2000]
Isomap Table 1. The Isomap algorithm takes as input the distances d X (i , j ) between all pairs i , j from N data points in the high-dimensional input space X , measured either in the standard Euclidean metric (as in Fig. 1A) or in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors y i in a d -dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the data. The only free parameter ( � or K ) appears in Step 1. Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by d X ( i , j )] they are closer than � ( � -Isomap), or if i is one of the K nearest neighbors of j ( K -Isomap). Set edge lengths equal to d X ( i , j ). 2 Compute shortest paths Initialize d G ( i , j ) � d X ( i , j ) if i , j are linked by an edge; d G ( i , j ) � � otherwise. Then for each value of k � 1, 2, . . ., N in turn, replace all entries d G ( i , j ) by min{ d G ( i , j ), d G ( i , k ) � d G ( k , j )}. The matrix of final values D G � { d G ( i , j )} will contain the shortest path distances between all pairs of points in G ( 16 , 19 ). 3 Construct d -dimensional embedding Let � p be the p -th eigenvalue (in decreasing order) of i be the i -th the matrix � ( D G ) ( 17 ), and v p component of the p -th eigenvector. Then set the p -th component of the d -dimensional coordinate i . vector y i equal to �� p v p
Isomap [Tenenbaum, Silva, Langford. Science 2000]
Isomap [Tenenbaum, Silva, Langford. Science 2000]
Isomap Swiss roll data Face images PCA Residual variance Isomap Number of dimensions
What you need to know • Dimensionality reduc9on – why and when it’s important • Principal component analysis – minimizing reconstruc9on error – rela9onship to covariance matrix and eigenvectors – using SVD • Non-linear dimensionality reduc9on
Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 1
Probabilistic modeling Given: several variables: x 1 , . . . x n , n is large. Task: build a joint distribution function Pr( x 1 , . . . x n ) Goal: Answer several kind of projection queries on the distribution Basic premise ◮ Explicit joint distribution is dauntingly large ◮ Queries are simple marginals (sum or max) over the joint distribution. 2
Examples of Joint Distributions So far Naive Bayes: P ( x 1 , . . . x d | y ) , d is large. Assume conditional independence. Multivariate Gaussian Recurrent Neural Networks for Sequence labeling and prediction 3
Example Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: 10 × 7 × 7 × 3 × 30 = 44100. Queries: Estimate fraction of people with ◮ Income > 200K and Degree=”Bachelors”, ◮ Income < 200K, Degree=”PhD” and experience > 10 years. ◮ Many, many more. 4
Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies ◮ Many highly correlated pairs income �⊥ ⊥ age, income �⊥ ⊥ experience, age �⊥ ⊥ experience ◮ Ad hoc methods of combining these into a single estimate Go beyond pairwise correlations: conditional independencies ◮ income �⊥ ⊥ age, but income ⊥ ⊥ age | experience ◮ experience ⊥ ⊥ degree, but experience �⊥ ⊥ degree | income Graphical models make explicit an efficient joint distribution from these independencies 5
Recommend
More recommend