Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2
SVM Logistic Regression Loss hinge cross entropy (log loss) Sensitive to Less More sensitive outliers Probabilistic? No Yes Multi-class? Only via separate Easy , using softmax model for each class (one-vs-all) Kernelizable? Yes, with speed Yes (cover next benefits from class) sparsity Mike Hughes - Tufts COMP 135 - Fall 2020 3
Multi-class SVMs • How do we extend idea of margin to more than 2 classes? Not so elegant. Two options: One vs rest Need to fit C separate models Pick class with largest f(x) One vs one Need to fit C(C-1)/2 models Pick class with most f(x) “wins” Mike Hughes - Tufts COMP 135 - Fall 2020 4
<latexit sha1_base64="CX2Uyb5hSgPhD4S+wTSYLrJzMWI=">ACJ3icbZBNSwMxEIazftb6VfXoJViEFqTsVkEvSrEXjxXaWujWk2zbWh2sySz2rL03jxr3gRVESP/hPTj4NWBwLPvDPDZF4vElyDbX9aC4tLyurqbX0+sbm1nZmZ7euZawoq1EpGp4RDPBQ1YDoI1IsVI4Al24/XL4/rNHVOay7AKw4i1AtINuc8pASO1Mxduj0ASjXKDPD7HLrABJFr6EJDBKIfv285tFQ+ODBSn4IqOBG3y8jPtzNZu2BPAv8FZwZNItKO/PidiSNAxYCFUTrpmNH0EqIAk4FG6XdWLOI0D7psqbBkARMt5LJnSN8aJQO9qUyLwQ8UX9OJCTQeh4pjMg0NPztbH4X60Zg3/WSngYxcBCOl3kxwKDxGPTcIcrRkEMDRCquPkrpj2iCAVjbdqY4Myf/BfqxYJzXChen2RLlzM7UmgfHaActApKqErVE1RNEDekKv6M16tJ6td+tj2rpgzWb20K+wvr4BFrKkLg=</latexit> Multi-class Logistic Regression • How do we extend LR to more than 2 classes? • Elegant: Can train weights using same prediction function we’ll use at test time p ( x ) = softmax( w T 1 x, w T 2 x, . . . w T ˆ C x ) Mike Hughes - Tufts COMP 135 - Fall 2020 5
Kernel methods Use kernel functions (similarity function with special properties) to obtain flexible high- dimensional feature transformations without explicit features Solve “dual” problem (for parameter alpha), not “primal” problem (for weights w) Can use the “kernel trick” for: * regression * classification (Logistic Regr. or SVM) Mike Hughes - Tufts COMP 135 - Fall 2020 6
Kernel Methods for Regression Kernels exist for: • Periodic regression • Histograms • Strings • Graphs, • And more! Mike Hughes - Tufts COMP 135 - Fall 2020 7
Review: Key concepts in supervised learning • Parametric vs nonparametric methods • Bias vs variance Mike Hughes - Tufts COMP 135 - Fall 2020 8
Parametric vs Nonparametric • Parametric methods • Complexity of decision function fixed in advance and specified by a finite fixed number of parameters, regardless of training data size Linear regression Neural networks Logistic regression • Nonparametric methods • Complexity of decision function can grow as more training data is observed Nearest neighbor methods Decision trees Ensembles of trees Mike Hughes - Tufts COMP 135 - Fall 2020 9
Bias & Variance Estimate (a random variable) ˆ y Known “true” response y Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Fall 2020 10
<latexit sha1_base64="h1ZEA4W0jGTPZAVN/oGQAtWEzI=">ACInicbVDLSgNBEJz1bXxFPXoZDIeDLtRUA9CUASPEUwUsmvonUzM4OyDmV4xLPstXvwVLx4U9ST4Mc7GCJpYMFBUVTPd5cdSaLTtD2tsfGJyanpmtjA3v7C4VFxeaegoUYzXWSQjdemD5lKEvI4CJb+MFYfAl/zCvznO/YtbrSIwnPsxdwL4DoUHcEAjdQqHrjI7zBtgMo23S5g2su26CF1A8Cu76cnWZP+6HSbuj6oPHFVoV6rWLdh90lDgDUiID1FrFN7cdsSTgITIJWjcdO0YvBYWCSZ4V3ETzGNgNXPOmoSEXHtp/8SMbhilTuRMi9E2ld/T6QaN0LfJPMV9fDXi7+5zUT7Ox7qQjBHnIvj/qJiRPO+aFsozlD2DAGmhNmVsi4oYGhaLZgSnOGTR0mjUnZ2ypWz3VL1aFDHDFkj62STOGSPVMkpqZE6YeSePJn8mI9WE/Wq/X+HR2zBjOr5A+szy8grqNc</latexit> Decompose into Bias & Variance is known “true” response value at given known heldout input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at x Bias : Error from average model to true y − y ) 2 (¯ y , E [ˆ ¯ y ] How far the average prediction of our model (averaged over all possible training sets of size N) is from true response Variance : h y 2 i y 2 y ) 2 ] = E Deviation over model samples Var(ˆ y ) = E [(ˆ y − ¯ ˆ − ¯ How far predictions based on a single training set are from the average prediction Mike Hughes - Tufts COMP 135 - Spring 2019 11
Total Error: Bias^2 + Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i Expected value is over y 2 − 2ˆ = E ˆ samples of the observed training set h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 = E ˆ − ¯ y − y ) 2 (¯ = Var(ˆ y )+ Variance Mike Hughes - Tufts COMP 135 - Fall 2020 12
Toy example: ISL Fig. 6.5 total error bias Error due to inability of typical fit (averaged over training sets) to capture true predictive relationship variance Error due to estimating from a More flexible Less flexible single finite-size training set overfitting underfitting All supervised learning methods must manage bias/variance tradeoff. Hyperparameter search is key. Mike Hughes - Tufts COMP 135 - Spring 2019 13
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Dimensionality Reduction & Embedding Prof. Mike Hughes Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU) 14
What will we learn? Supervised Learning Data Examples Performance { x n } N measure Task n =1 Unsupervised Learning summary data of x x Reinforcement Learning Mike Hughes - Tufts COMP 135 - Fall 2020 15
Task: Embedding Supervised Learning x 2 Unsupervised Learning embedding Reinforcement x 1 Learning Mike Hughes - Tufts COMP 135 - Fall 2020 16
Dim. Reduction/Embedding Unit Objectives • Goals of dimensionality reduction • Reduce feature vector size (keep signal, discard noise) • “Interpret” features: visualize/explore/understand • Common approaches • Principal Component Analysis (PCA) • word2vec and other neural embeddings • Evaluation Metrics • Storage size - Reconstruction error • “Interpretability” Mike Hughes - Tufts COMP 135 - Fall 2020 17
Example: 2D viz. of movies Mike Hughes - Tufts COMP 135 - Fall 2020 18
Example: Genes vs. geography Nature, 2008 Where possible, we based the geographic origin on the observed country data for grandparents. We used a ‘strict consensus’ approach: if all observed grandparents originated from a single country, we used that country as the origin. If an individual’s observed grandparents originated from different countries, we excluded the individual. Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Mike Hughes - Tufts COMP 135 - Fall 2020 19
Example: Genes vs. geography Nature, 2008 Mike Hughes - Tufts COMP 135 - Fall 2020 20
Example: Eigen Clothing Mike Hughes - Tufts COMP 135 - Fall 2020 21
Mike Hughes - Tufts COMP 135 - Fall 2020 22
Centering the Data Goal: each feature’s mean = 0.0 Mike Hughes - Tufts COMP 135 - Spring 2019 23
Why center? • Think of mean vector as simplest possible “reconstruction” of a dataset • No example specific parameters, just one F- dim vector N ( x n − m ) T ( x n − m ) X min m ∈ R F n =1 m ∗ = mean( x 1 , . . . x N ) Mike Hughes - Tufts COMP 135 - Spring 2019 24
Mean reconstruction original reconstructed Mike Hughes - Tufts COMP 135 - Fall 2020 25
Principal Component Analysis Mike Hughes - Tufts COMP 135 - Fall 2020 26
Linear Projection to 1D Mike Hughes - Tufts COMP 135 - Fall 2020 27
Reconstruction from 1D to 2D Mike Hughes - Tufts COMP 135 - Fall 2020 28
2D Orthogonal Basis Mike Hughes - Tufts COMP 135 - Fall 2020 29
Which 1D projection is best? Idea: Minimize reconstruction error Mike Hughes - Tufts COMP 135 - Fall 2020 30
K-dim Reconstruction with PCA x i = Wz i + m + F vector F vector F x K K vector “mean” High- Weights Low-dim vector dim. vector data Problem: Over-parameterized. Too many possible solutions! If we scale z x2, we can scale W / 2 and get equivalent reconstruction We need to constrain the magnitude of weights. Let’s make all the weight vectors be unit vectors: ||W||_2 = 1 Mike Hughes - Tufts COMP 135 - Spring 2019 31
Principal Component Analysis Training step: .fit() • Input: • X : training data, N x F • N high-dim. example vectors • K : int, number of components • Satisfies 1 <= K <= F • Output: Trained parameters for PCA • m : mean vector, size F • W : learned basis of weight vectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other Mike Hughes - Tufts COMP 135 - Spring 2019 32
Principal Component Analysis Transformation step: .transform() • Input: • X : training data, N x F • N high-dim. example vectors • Trained PCA “model” • m : mean vector, size F • W : learned basis of eigenvectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other • Output: • Z : projected data, N x K Mike Hughes - Tufts COMP 135 - Spring 2019 33
Example: EigenFaces Mike Hughes - Tufts COMP 135 - Fall 2020 34
Recommend
More recommend