Machine Learning for Signal Processing Supervised Representations: Class 19. 8 Nov 2016 Bhiksha Raj Slides by Najim Dehak MLSP 1
Definitions: Variance and Covariance π π§ π π¦π§ > 0 β ππ§ π π¦ ππ¦ > 0 Variance: S XX = E(XX T ), estimated as S XX = (1/N) XX T β’ β How βspreadβ is the data in the direction of X 2 = πΉ(π¦ 2 ) β Scalar version: π π¦ Covariance: S XY = E(XY T ) estimated as S XY = (1/N) XY T β’ β How much does X predict Y β Scalar version: π π¦π§ = πΉ(π¦ 2 ) MLSP 2
Definition: Whitening Matrix = π π π π β0.5 (π β π ) π = Ξ£ ππ If X is already centered β0.5 π π(π) π = Ξ£ ππ π(π) β0.5 β’ Whitening matrix: Ξ£ ππ β’ Transforms the variable to unit variance β1 β’ Scalar version: π π¦ MLSP 3
Definition: Correlation Coefficient π ππ > π π > π π π§ β1 π¦ 1 π¦ = π π¦ β1 π§ π π¦ π§ = π π§ 1 β0.5 Ξ£ ππ Ξ£ ππ β0.5 β’ Whitening matrix: Ξ£ ππ π π¦π§ β’ Scalar version: π π¦π§ = π π§ π π§ β Explains how Y varies with X , after normalizing out innate variation of X and Y MLSP 4
MLSP β’ Application of Machine Learning techniques to the analysis of signals External Knowledge sensor Signal Feature Modeling/ Channel Capture Extraction Regression β’ Feature Extraction: β Supervised (Guided) representation MLSP 5
Data specific bases? β’ Issue : The bases we have considered so far are data agnostic β Fourier / Wavelet type bases for all data may not be optimal β’ Improvement I : The bases we saw next were data specific β PCA, NMF, ICA, ... β The bases changed depending on the data β’ Improvement II : What if bases are both data specific and task specific? β Basis depends on both the data and a task MLSP 6
Recall: Unsupervised Basis Learning β’ What is a good basis? β Energy Compaction ο Karkhonen-LoΓ¨ve β Uncorrelated ο PCA β Sparsity ο Sparse Representation, Compressed Sensing, β¦ β Statistically Independent ο ICA β’ We create a narrative about how the data are created MLSP 7
Supervised Basis Learning? β’ What is a good basis? β Basis that gives best classification performance β Basis that maximizes shared information with another βviewβ β’ We have some external information guiding our notion of optimal basis β Can we learn a basis for a set of variables that will best predict some value(s) MLSP 8
Regression β’ Simplest case β Given a bunch of scalar data points predict some value β Years are independent β Temperature is dependent MLSP 9
Regression β’ Formulation of problem β’ Letβs solve! MLSP 10
Regression β’ Expand out the Frobenius norm β’ Take derivative β’ Solve for 0 MLSP 11
Regression β’ This is just basically least squares again β’ Note that this looks a lot like the following β In the 1- d case where x predicts y this is just β¦ MLSP 12
Multiple Regression β’ Robot Archer Example β Our robot fires defective arrows at a target β’ We donβt know how wind might affect their movement, but weβd like to correct for it if possible. β Predict the distance from the center of a target of a fired arrow β’ Measure wind speed in 3 directions 1 π₯ π¦ π π = π₯ π§ π₯ π¨ MLSP 13
Multiple Regression 1 β’ Wind speed π₯ π¦ π π = π₯ π§ π₯ π¨ π π¦ β’ Offset from center in 2 directions π π = π π§ β’ Model π π = πΎπ π MLSP 14
Multiple Regression β’ Answer β Here Y contains measurements of the distance of the arrow from the center β We are fitting a plane β Correlation is basically just the gradient MLSP 15
Canonical Correlation Analysis β’ Further Generaliztion (CCA) β Do all wind factors affect the position = π΅π β’ Or just some low-dimensional combinations π β Do they affect both coordinates individually β’ Or just some of combination π§ = πΆπ x x x x x x x x x x x MLSP 16
Canonical Correlation Analysis β’ Letβs call the arrow location vector Y and the wind vectors X β Letβs find the projection of the vectors for Y and X respectively that are most correlated w z x x x x x x x x x x x w x w y Best X projection plane Predicts best Y projection MLSP 17
Canonical Correlation Analysis β’ What do these vectors represent? β Direction of max correlation ignores parts of wind and location data that do not affect each other β’ Only information about the defective arrow remains! w z x x x x x x x x x x x w x w y Best X projection plane Predicts best Y projection MLSP 18
CCA Motivation and History β’ Proposed by Hotelling (1936) β’ Many real world problems involve 2 βviewsβ of data β’ Economics β Consumption of wheat is related to the price of potatoes, rice and barley β¦ and wheat β Random vector of prices X β Random vector of consumption Y Y = Consumption X = Prices MLSP 19
CCA Motivation and History β’ Magnus Borga, David Hardoon popularized CCA as a technique in signal processing and machine learning β’ Better for dimensionality reduction in many cases MLSP 20
CCA Dimensionality Reduction β’ We keep only the correlated subspace β’ Is this always good? β If we have measured things we care about then we have removed useless information MLSP 21
CCA Dimensionality Reduction β’ In this case: β CCA found a basis component that preserved class distinctions while reducing dimensionality β Able to preserve class in both views MLSP 22
Comparison to PCA β’ PCA fails to preserve class distinctions as well MLSP 23
Failure of PCA β’ PCA is unsupervised β Captures the direction of greatest variance (Energy) β No notion of task or hence what is good or bad information β The direction of greatest variance can sometimes be noise β Ok for reconstruction of signal β Catastrophic for preserving class information in some cases MLSP 24
Benefits of CCA β’ Why did CCA work? β Soft supervision β’ External Knowledge β The 2 views track each other in a direction that does not correspond to noise β Noise suppression (sometimes) β’ Preview β If one of the sets of signals are true labels, CCA is equivalent to Linear Discriminant Analysis β Hard Supervision MLSP 25
Multiview Assumption β’ When does CCA work? β The correlated subspace must actually have interesting signal β’ If two views have correlated noise then we will learn a bad representation β’ Sometimes the correlated subspace can be noise β Correlated noise in both sets of views MLSP 26
Multiview Assumption β’ Why not just concatenate both views? β It does not exploit the extra structure of the signal (more on this in 2 slides) β’ PCA on joint data will decorrelate all variables β Not good for prediction β’ We want to decorrelate X and Y, but maximize cross-correlation between X and Y β High dimensionality ο over-fit w z x x x x x x x x x x x w x w y MLSP 27
Multiview Assumption β’ We can sort of think of a model for how our data might be generated View 1 Source View 2 β’ We want View 1 independent of View 2 conditioned on knowledge of the source β All correlation is due to source MLSP 28
Multiview Examples β’ Look at many stocks from different sectors of the economy β Conditioned on the fact that they are part of the same economy they might be independent of one another β’ Multiple Speakers saying the same sentence β’ The sentence generates signals from many speakers. Each speaker might be independent of each other conditioned on the sentence View 1 Source View 2 MLSP 29
Multiview Examples http://mlg.postech.ac.kr/static/research/multiview_overview.png MLSP 30
Matrix Representation π 2 πΉ = π π β π π π = [π 1 , π 2 , β¦ , π π ] π = [π 1 , π 2 , β¦ , π π ] 2 = π π π π π = π’π πππ ππ π π πΊ π 2 = π’π πππ(π β π)(π β π) π πΉ = π β π πΊ β’ Expressing total error as a matrix operation MLSP 31
Recall: Objective Functions β’ Least Squares β’ What is a good basis? β Energy Compaction ο Karkhonen-LoΓ¨ve β Positive Sparse ο NMF β Regression MLSP 32
A Quick Review β’ Cross Covariance MLSP 33
A Quick Review β’ The effect of a transform π = ππ π· ππ = πΉ[ππ π ] π· ππ = πΉ ππ π = ππ· ππ π π MLSP 34
Recall: Objective Functions β’ So far our objective needs to external data β No knowledge of task π‘. π’. π β β πΓπ 2 argmin π β ππ πΊ π πππ π = π πββ πΓπ β’ CCA requires an extra view β We force both views to look like each other 2 πββ ππ¦Γπ , πββ ππ§Γπ π π π β π π π πΊ min π‘. π’. π π π· ππ π = π½ π , π π π· ππ π = π½ π MLSP 35
Interpreting the CCA Objective β’ Minimize the reconstruction error between the projections of both views of data β’ Find the subspaces U,V onto which we project views X and Y such that their correlation is maximized β’ Find combinations of both views that best predict each other MLSP 36
Recommend
More recommend