Machine Learning (CSE 446): PCA (continued) and Learning as - PowerPoint PPT Presentation

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17

PCA: continuing on... 1 / 17

Dimension of Greatest Variance Assume that the data are centered , i.e., that � � � x n � N mean = 0 . n =1 1 / 17

Projection into One Dimension Let u be the dimension of greatest variance, where � u � 2 = 1 . p n = x n · u is the projection of the n th example onto u . Since the mean of the data is 0 , the mean of � p 1 , . . . , p N � is also 0 . N � This implies that the variance of � p 1 , . . . , p N � is 1 p 2 n . N n =1 The u that gives the greatest variance, then, is: N � ( x n · u ) 2 argmax u n =1 2 / 17

Finding the Maximum-Variance Direction N � ( x n · u ) 2 argmax u n =1 s.t. � u � 2 = 1 (Why do we constrain u to have length 1?)   x ⊤ 1   x ⊤   � Xu � 2 , s.t. � u � 2 = 1 . If we let X =  , then we want: argmax   . .  . u x ⊤ N 2- This is PCA in one dimension! 3 / 17

Linear algebra review: things to understand ◮ � x � 2 is the Euclidean norm. ◮ What is the dimension of Xu ? ◮ What is i -th component of Xu ? ◮ Also, note: � u � 2 = u ⊤ u ◮ So what is � Xu � 2 ? 4 / 17

Constrained Optimization The blue lines represent contours : all points on a blue line have the same objective function value. 5 / 17

Deriving the Solution Don’t panic. � Xu � 2 , s.t. � u � 2 = 1 argmax u ◮ The Lagrangian encoding of the problem moves the constraint into the objective: λ � Xu � 2 − λ ( � u � 2 − 1) � Xu � 2 − λ ( � u � 2 − 1) max min ⇒ min λ max u u 6 / 17

Deriving the Solution Don’t panic. � Xu � 2 , s.t. � u � 2 = 1 argmax u ◮ The Lagrangian encoding of the problem moves the constraint into the objective: λ � Xu � 2 − λ ( � u � 2 − 1) � Xu � 2 − λ ( � u � 2 − 1) max min ⇒ min λ max u u ◮ Gradient (first derivatives with respect to u ): 2 X ⊤ Xu − 2 λ u ◮ Setting equal to 0 leads to: λ u = X ⊤ Xu ◮ You may recognize this as the definition of an eigenvector ( u ) and eigenvalue ( λ ) for the matrix X ⊤ X . ◮ We take the first (largest) eigenvalue. 6 / 17

Deriving the Solution: Scratch space 7 / 17

Variance in Multiple Dimensions So far, we’ve projected each x n into one dimension. To get a second direction v , we solve the same problem again, but this time with another constraint: � Xv � 2 , s.t. � v � 2 = 1 and u · v = 0 argmax v (That is, we want a dimension that’s orthogonal to the u that we found earlier.) Following the same steps we had for u , the solution will be the second eigenvector. 8 / 17

“Eigenfaces” Fig. from https://github.com/AlexOuyang/RealTimeFaceRecognition 9 / 17

Principal Components Analysis ◮ Input: unlabeled data X = [ x 1 | x 2 | · · · | x N ] ⊤ ; dimensionality K < d ◮ Output: K -dimensional “subspace”. ◮ Algorithm: 1. Compute the mean µ 2. compute the covariance matrix : � Σ = 1 ( x i − µ )( x i − µ ) ⊤ N i 3. let � λ 1 , . . . , λ K � be the top K eigenvalues of Σ and � u 1 , . . . , u K � be the corresponding eigenvectors ◮ Let � U = [ u 1 | u | · · · | u K ] Return � U You can read about many algorithms for finding eigendecompositions of a matrix. 10 / 17

Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 11 / 17

Projection and Reconstruction: the one dimensional case ◮ Take out mean µ : ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions, � x N ] ⊤ ? X = [ � x 1 | � x 2 | · · · | � ◮ Whis is your reconstruction error? � 1 x i ) 2 =?? ( x i − � N i 12 / 17

Alternate View: Minimizing Reconstruction Error with K -dim subspace. Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u 1 , u 2 , . . . u K which minimizes the total reconstruction error on the data: � 1 ( x i − Proj u 1 ,... u K ( x i )) 2 argmin N orthonormal basis: u 1 , u 2 ,... u K i Recall the projection of x onto K -orthonormal basis is: K � Proj u 1 ,... u K ( x ) = ( u i · x ) u i j =1 The SVD “simultaneously” finds all u 1 , u 2 , . . . u K 13 / 17

Choosing K (Hyperparameter Tuning) How do you select K for PCA? Read CIML (similar methods for K -means) 13 / 17

PCA and Clustering There’s a unified view of both PCA and clustering. ◮ K -Means chooses cluster-means so that squared distances to data are small. ◮ PCA chooses a basis so that reconstruction error of data is small. Both attempt to find a “simple” way to summarize the data: fewer points or fewer dimensions. Both could be used to create new features for supervised learning 14 / 17

Loss functions 14 / 17

Perceptron A model and an algorithm, rolled into one. Model: f ( x ) = sign ( w · x + b ) , known as linear , visualized by a (hopefully) separating hyperplane in feature-space. Algorithm: PerceptronTrain , an error-driven, iterative updating algorithm. 15 / 17

A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: loss N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� ǫ train ≡ zero-one loss margin = y · ( w · x + b ) 16 / 17

A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: loss � N 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� ǫ train ≡ zero-one loss margin = y · ( w · x + b ) This problem is NP-hard; even solving trying to get a (multiplicaive) approximatation is NP-hard. 16 / 17

A Different View of PerceptronTrain : Optimization loss “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N margin = y · ( w · x + b ) w ,b n =1 � �� ǫ train ≡ zero-one loss What the perceptron does: loss � N 1 min max( − y n · ( w · x + b ) , 0) N � �� w ,b n =1 perceptron loss margin = y · ( w · x + b ) 16 / 17

A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� ǫ train ≡ zero-one loss What the perceptron does: � N 1 min max( − y n · ( w · x + b ) , 0) � �� N w ,b n =1 perceptron loss 16 / 17

A Different View of PerceptronTrain : Optimization “Minimize training-set error rate”: N � 1 min � y n · ( w · x + b ) ≤ 0 � N w ,b n =1 � �� ǫ train ≡ zero-one loss What the perceptron does: N � 1 min max( − y n · ( w · x + b ) , 0) � �� N w ,b n =1 perceptron loss 16 / 17

Smooth out the Loss? 17 / 17

Machine Learning (CSE 446): PCA (continued) and Learning as - PowerPoint PPT Presentation

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17 PCA: continuing on... 1 / 17 Dimension of Greatest Variance Assume that the

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

IT support Learning aim To be able to Support users IT personnel Users Manage IT

Superusers and IT support Learning aim Identify groups supporting in IT use Specify

Lecture 24: Perceptrons Regression Prof. Julia Hockenmaier juliahmr@illinois.edu

Algebraic and combinatorial methods for bounding the number of the complex embeddings of

Lifting the Cartier transform of Ogus and Vologodsky modulo p n [following H. Oyama, A. Shiho and

Kai Xu Lecturer, London Bachelor, Shanghai PhD, Brisbane Research Scientist, Research

Skeleton and Dual Complex Chenyang Xu Beijing International Center of Mathematics Research

Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation Alexander

Machine Learning (CSE 446): PCA (continued) and Learning as - PowerPoint PPT Presentation

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 17 PCA: continuing on... 1 / 17 Dimension of Greatest Variance Assume that the

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018

Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning (CSE 446): Probabilistic Machine Learning MLE &amp; MAP Sham M Kakade 2018

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

IT support Learning aim To be able to Support users IT personnel Users Manage IT

Superusers and IT support Learning aim Identify groups supporting in IT use Specify

Lecture 24: Perceptrons Regression Prof. Julia Hockenmaier juliahmr@illinois.edu

Algebraic and combinatorial methods for bounding the number of the complex embeddings of

Lifting the Cartier transform of Ogus and Vologodsky modulo p n [following H. Oyama, A. Shiho and

Kai Xu Lecturer, London Bachelor, Shanghai PhD, Brisbane Research Scientist, Research

Skeleton and Dual Complex Chenyang Xu Beijing International Center of Mathematics Research

Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation Alexander

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence