CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia

Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1

Reducing DImensions

Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 3

Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 3

Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 π d / 2 2 + 1 ) ( r d 1 − r d ◮ General form: 2 ) Γ ( d with r d 2 → 0 when d → ∞ ◮ E.g., r 500 � 0 . 00657 2 3

Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 π d / 2 2 + 1 ) ( r d 1 − r d ◮ General form: 2 ) Γ ( d with r d 2 → 0 when d → ∞ ◮ E.g., r 500 � 0 . 00657 2 Question: what will happen if we uniformly sample from a d -dimensional ball? 3

Dimensionality Reduction Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller. 4

Dimensionality Reduction Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller. Mathematically, it means f : x → ˜ (1) x x ∈ R n with n < d where x ∈ R d , ˜ 4

Reducing Dimensions: A toy example For the purpose of reducing dimensions, we can project x � ( x 1 , x 2 ) into the direction along x 1 or x 2 x 2 x 1 Question: Given these two data examples, which direction we should pick? x 1 or x 2 ? 5

Reducing Dimensions: A toy example (II) There is a better solution if we are allowed to rotate the coordinate x 2 x 1 6

Reducing Dimensions: A toy example (II) There is a better solution if we are allowed to rotate the coordinate x 2 u 1 u 2 x 1 Pick u 1 , then we preserve all the variance of the examples 6

Reducing Dimensions: A toy example (III) Consider a general case, where the examples do not lie on a perfect line 7 [Bishop, 2006, Section 12.1]

Reducing Dimensions: A toy example (III) Consider a general case, where the examples do not lie on a perfect line We can follow the same idea by finding a direction that can preserve most of the variance of the examples 7 [Bishop, 2006, Section 12.1]

Principal Component Analysis

Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) 9

Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) ◮ Assume the direction that we would like to project the data is u , then the objective function is the data variance m J ( u ) � 1 � ( u T x i ) 2 (3) m i � 1 9

Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) ◮ Assume the direction that we would like to project the data is u , then the objective function is the data variance m J ( u ) � 1 � ( u T x i ) 2 (3) m i � 1 ◮ Maximize J ( u ) is trivial, if there is no constriant on u . Therefore, we set � u � 2 2 � u T u � 1 9

Covariance Matrix The definition of J ( u ) can be written as m 1 � ( u T x i ) 2 J ( u ) (4) � m i � 1 m 1 � u T x i u T x i (5) � m i � 1 m 1 � u T x i x T (6) i u � m i � 1 u T � 1 m � � x i x T (7) u � i m i � 1 u T Σ u (8) � where Σ is the data covariance matrix 10

Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) 11

Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) ◮ It can be converted to an unconstrained optimization problem with a Lagrange multiplier � u T Σ u + λ ( 1 − u T u ) � max (11) u 11

Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) ◮ It can be converted to an unconstrained optimization problem with a Lagrange multiplier � u T Σ u + λ ( 1 − u T u ) � max (11) u ◮ The optimal solution is given by Σ u − λ u � 0 (12) Σ u � λ u (13) 11

Two Observations There are two observations from Σ u � λ u (14) ◮ Firs, λ is an eigenvalue of Σ and u is the corresponding eigenvector (Lecture 01 page 29). 12

Two Observations There are two observations from Σ u � λ u (14) ◮ Firs, λ is an eigenvalue of Σ and u is the corresponding eigenvector (Lecture 01 page 29). ◮ Second, multiplying u T on both sides, we have u T Σ u � λ (15) In order to maximize J ( u ) , λ has to the largest eigenvalue and 12

Principal Component Analysis ◮ As u indicates the first major direction that can preserve the data variance, it is called the first principal component 13

Principal Component Analysis ◮ As u indicates the first major direction that can preserve the data variance, it is called the first principal component ◮ In general, with eigen decomposition, we have U T Σ U � Λ (16) ◮ Eigenvalues Λ � diag ( λ 1 , . . . , λ d ) ◮ Eigenvectors U � [ u 1 , . . . , u d ] 13

Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) 14

Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) 14

Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) ◮ Reduce the dimensionality of x as x � ˜ U T x ∈ R n ˜ (19) 14

Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) ◮ Reduce the dimensionality of x as x � ˜ U T x ∈ R n ˜ (19) ◮ The value of n can be determined by the following � n i � 1 λ i ≈ 0 . 95 (20) � d i � 1 λ i 14

Applications: Image Processing Reduce the dimensionality of an image dataset from 28 × 28 � 784 to M (a) Original data [Bishop, 2006, Section 12.1] 15

Applications: Image Processing Reduce the dimensionality of an image dataset from 28 × 28 � 784 to M (a) Original data (b) With the first M principal components [Bishop, 2006, Section 12.1] 15

A Different Viewpoint of PCA

Data Reconstruction Another way to formulate the objective function of PCA m � � x i − UW x i � 2 min (21) 2 W , U i � 1 where ◮ W ∈ R n × d : mapping x i from the original space to a lower-dimensional space R n ◮ U ∈ R d × n : mapping back the original space R d [Shalev-Shwartz and Ben-David, 2014, Chap 23] 17

Data Reconstruction Another way to formulate the objective function of PCA m � � x i − UW x i � 2 min (21) 2 W , U i � 1 where ◮ W ∈ R n × d : mapping x i from the original space to a lower-dimensional space R n ◮ U ∈ R d × n : mapping back the original space R d ◮ Dimensionality reduction is performed as ˜ x � Ux , while W make sure the reduction does not loss much information [Shalev-Shwartz and Ben-David, 2014, Chap 23] 17

Optimization Consider the optimization problem m � � x i − UW x i � 2 min (22) 2 W , V i � 1 ◮ Let W , U be a solution of equation 24 [Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W � U T 18

Optimization Consider the optimization problem m � � x i − UW x i � 2 min (22) 2 W , V i � 1 ◮ Let W , U be a solution of equation 24 [Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W � U T ◮ The optimization problem can be simplified as m � � x i − UU T x i � 2 min (23) 2 U T U � I i � 1 The solution will be the same. 18

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1 Reducing DImensions

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu

Surfaces How to carry to surface? Texture Synthesis: Surfaces and RD from Wei & Levoy = +

Rehashing Kernel Evaluation in High Dimensions Paris Siminelakis* Ph.D. Candidate Kexin Rong*,

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

A Network Coding Approach A Network Coding Approach to IP Traceback Pegah Sattari, Minas Gjokas,

Lecture 7.6: Rings of fractions Matthew Macauley Department of Mathematical Sciences Clemson

Plug-and-Play Methods Provably Converge with Properly Trained Denoisers Ernest K. Ryu 1 Sicheng

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1 Reducing DImensions

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu

Surfaces How to carry to surface? Texture Synthesis: Surfaces and RD from Wei &amp; Levoy = +

Rehashing Kernel Evaluation in High Dimensions Paris Siminelakis* Ph.D. Candidate Kexin Rong*,

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

A Network Coding Approach A Network Coding Approach to IP Traceback Pegah Sattari, Minas Gjokas,

Lecture 7.6: Rings of fractions Matthew Macauley Department of Mathematical Sciences Clemson

Plug-and-Play Methods Provably Converge with Properly Trained Denoisers Ernest K. Ryu 1 Sicheng

Surfaces How to carry to surface? Texture Synthesis: Surfaces and RD from Wei & Levoy = +