dimensionality reduc1on lecture 23
play

Dimensionality Reduc1on Lecture 23 David Sontag New York - PowerPoint PPT Presentation

Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer Dimensionality reduc9on Input data may have thousands or millions of dimensions! e.g., text data has ???,


  1. Dimensionality Reduc1on Lecture 23 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer

  2. Dimensionality reduc9on • Input data may have thousands or millions of dimensions! – e.g., text data has ???, images have ??? • Dimensionality reduc1on : represent data with fewer dimensions – easier learning – fewer parameters – visualiza9on – show high dimensional data in 2D – discover “intrinsic dimensionality” of data • high dimensional data that is truly lower dimensional • noise reduc9on

  3. !"#$%&"'%()$*+,-"'% � .&&+#/-"'%0(*1-1(21//)'3"#1-$456(4"$&('%( 1(4'7$)(*"#$%&"'%14(&/1,$ � 831#/4$&0 n = 2 n = 3 k = 1 k = 2 Slide from Yi Zhang

  4. Example (from Bishop) • Suppose we have a dataset of digits (“3”) perturbed in various ways: • What opera9ons did I perform? What is the data’s intrinsic dimensionality? • Here the underlying manifold is nonlinear

  5. Lower dimensional projec9ons • Obtain new feature vector by transforming the original features x 1 … x n z 1 = w (1) w (1) ⌥ + x i ⌥ 0 In general will not be i … inver9ble – cannot go i from z back to x z k = w ( k ) w ( k ) ⌥ + x i 0 i i • New features are linear combina9ons of old ones • Reduces dimension when k<n • This is typically done in an unsupervised seZng – just X , but no Y

  6. Which projec9on is be[er? From notes by Andrew Ng

  7. Reminder: Vector Projec9ons • Basic defini9ons: – A.B = |A||B|cos θ • Assume |B|=1 (unit vector) – A.B = |A|cos θ – So, dot product is length of projec9on!

  8. Using a new basis for the data • Project a point into a (lower dimensional) space: – point : x = (x 1 ,…,x n ) – select a basis – set of unit (length 1) basis vectors ( u 1 ,…, u k ) • we consider orthonormal basis: – u j • u j =1, and u j • u l =0 for j ≠ l – select a center – x , defines offset of space – best coordinates in lower dimensional space defined by dot-products: (z 1 ,…,z k ), z j i = ( x i - x ) • u j

  9. Maximize variance of projec9on Let x (i) be the i th data point minus the mean. Choose unit-length u to maximize: m m 1 1 Covariance ( x ( i ) T u ) 2 u T x ( i ) x ( i ) T u � � = matrix Σ m m i =1 i =1 � � m 1 x ( i ) x ( i ) T � u T = u. m i =1 Let ||u||=1 and maximize. Using the method of Lagrange multipliers, can show that the solution is given by the principal eigenvector of the covariance matrix! (shown on board)

  10. Basic PCA algorithm [Pearson 1901, Hotelling, 1933] • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Compute covariance matrix: – Σ ← 1/m X c T X c • Find eigen vectors and values of Σ • Principal components: k eigen vectors with highest eigen values

  11. PCA example Data: Projection: Reconstruction:

  12. Dimensionality reduc9on with PCA In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability Only keep data projections onto principal components with large eigenvalues Can ignore the components of lesser significance. m 1 X ( z i j ) 2 var( z j ) = m 25 Percentage of total variance captured i =1 m by dimension z j for j=1 to 10: 1 λ j X ( x i · u j ) 2 = 20 P n l =1 λ l m i =1 Variance (%) = λ j 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You might lose some information, but if the eigenvalues �������������������������� much 23 Slide from Aarti Singh

  13. Eigenfaces [Turk, Pentland ’91] • Input images: � Principal components:

  14. Eigenfaces reconstruc9on • Each image corresponds to adding together (weighted versions of) the principal components:

  15. Scaling up • Covariance matrix can be really big! – Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow… • Use singular value decomposi9on (SVD) – Finds k eigenvectors – great implementa9ons available, e.g., Matlab svd

  16. SVD • Write X = Z S U T – X ← data matrix, one row per datapoint – S ← singular value matrix, diagonal matrix with entries σ i • Rela9onship between singular values of X and eigenvalues of Σ given by λ i = σ i 2 /m – Z ← weight matrix, one row per datapoint • Z 9mes S gives coordinate of x i in eigenspace – U T ← singular vector matrix • In our seZng, each row is eigenvector u j

  17. PCA using SVD algorithm • Start from m by n data matrix X • Recenter : subtract mean from each row of X – X c ← X – X • Call SVD algorithm on X c – ask for k singular vectors • Principal components: k singular vectors with highest singular values (rows of U T ) – Coefficients: project each point onto the new vectors

  18. Non-linear methods � A%&,'- /)-%,-0"1&2.30.%$%#&4%"156-6&7/248 B'2("-*C&'45)%) D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C? � E"&4%&,'- !"01",-"% *-9$%3"06 DF@GCH A"2'4*A%&,'-*8#$,//%&6*=AA8? 12 Slide from Aarti Singh

  19. Isomap Es9mate manifold using Goal: use geodesic Embed onto 2D plane graph. Distance between distance between points so that Euclidean distance points given by distance of (with respect to manifold) approximates graph shortest path distance [Tenenbaum, Silva, Langford. Science 2000]

  20. Isomap Table 1. The Isomap algorithm takes as input the distances d X (i , j ) between all pairs i , j from N data points in the high-dimensional input space X , measured either in the standard Euclidean metric (as in Fig. 1A) or in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors y i in a d -dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the data. The only free parameter ( � or K ) appears in Step 1. Step 1 Construct neighborhood graph Define the graph G over all data points by connecting points i and j if [as measured by d X ( i , j )] they are closer than � ( � -Isomap), or if i is one of the K nearest neighbors of j ( K -Isomap). Set edge lengths equal to d X ( i , j ). 2 Compute shortest paths Initialize d G ( i , j ) � d X ( i , j ) if i , j are linked by an edge; d G ( i , j ) � � otherwise. Then for each value of k � 1, 2, . . ., N in turn, replace all entries d G ( i , j ) by min{ d G ( i , j ), d G ( i , k ) � d G ( k , j )}. The matrix of final values D G � { d G ( i , j )} will contain the shortest path distances between all pairs of points in G ( 16 , 19 ). 3 Construct d -dimensional embedding Let � p be the p -th eigenvalue (in decreasing order) of i be the i -th the matrix � ( D G ) ( 17 ), and v p component of the p -th eigenvector. Then set the p -th component of the d -dimensional coordinate i . vector y i equal to �� p v p

  21. Isomap [Tenenbaum, Silva, Langford. Science 2000]

  22. Isomap [Tenenbaum, Silva, Langford. Science 2000]

  23. Isomap Swiss roll data Face images PCA Residual variance Isomap Number of dimensions

  24. What you need to know • Dimensionality reduc9on – why and when it’s important • Principal component analysis – minimizing reconstruc9on error – rela9onship to covariance matrix and eigenvectors – using SVD • Non-linear dimensionality reduc9on

  25. Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 1

  26. Probabilistic modeling Given: several variables: x 1 , . . . x n , n is large. Task: build a joint distribution function Pr( x 1 , . . . x n ) Goal: Answer several kind of projection queries on the distribution Basic premise ◮ Explicit joint distribution is dauntingly large ◮ Queries are simple marginals (sum or max) over the joint distribution. 2

  27. Examples of Joint Distributions So far Naive Bayes: P ( x 1 , . . . x d | y ) , d is large. Assume conditional independence. Multivariate Gaussian Recurrent Neural Networks for Sequence labeling and prediction 3

  28. Example Variables are attributes are people. Age Income Experience Degree Location 10 ranges 7 scales 7 scales 3 scales 30 places An explicit joint distribution over all columns not tractable: number of combinations: 10 × 7 × 7 × 3 × 30 = 44100. Queries: Estimate fraction of people with ◮ Income > 200K and Degree=”Bachelors”, ◮ Income < 200K, Degree=”PhD” and experience > 10 years. ◮ Many, many more. 4

  29. Alternatives to an explicit joint distribution Assume all columns are independent of each other: bad assumption Use data to detect pairs of highly correlated column pairs and estimate their pairwise frequencies ◮ Many highly correlated pairs income �⊥ ⊥ age, income �⊥ ⊥ experience, age �⊥ ⊥ experience ◮ Ad hoc methods of combining these into a single estimate Go beyond pairwise correlations: conditional independencies ◮ income �⊥ ⊥ age, but income ⊥ ⊥ age | experience ◮ experience ⊥ ⊥ degree, but experience �⊥ ⊥ degree | income Graphical models make explicit an efficient joint distribution from these independencies 5

Recommend


More recommend