kernel methods
play

Kernel Methods For Regression and Classification Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2 SVM Logistic Regression Loss hinge


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2

  2. SVM Logistic Regression Loss hinge cross entropy (log loss) Sensitive to Less More sensitive outliers Probabilistic? No Yes Multi-class? Only via separate Easy , using softmax model for each class (one-vs-all) Kernelizable? Yes, with speed Yes (cover next benefits from class) sparsity Mike Hughes - Tufts COMP 135 - Fall 2020 3

  3. Multi-class SVMs • How do we extend idea of margin to more than 2 classes? Not so elegant. Two options: One vs rest Need to fit C separate models Pick class with largest f(x) One vs one Need to fit C(C-1)/2 models Pick class with most f(x) “wins” Mike Hughes - Tufts COMP 135 - Fall 2020 4

  4. <latexit sha1_base64="CX2Uyb5hSgPhD4S+wTSYLrJzMWI=">ACJ3icbZBNSwMxEIazftb6VfXoJViEFqTsVkEvSrEXjxXaWujWk2zbWh2sySz2rL03jxr3gRVESP/hPTj4NWBwLPvDPDZF4vElyDbX9aC4tLyurqbX0+sbm1nZmZ7euZawoq1EpGp4RDPBQ1YDoI1IsVI4Al24/XL4/rNHVOay7AKw4i1AtINuc8pASO1Mxduj0ASjXKDPD7HLrABJFr6EJDBKIfv285tFQ+ODBSn4IqOBG3y8jPtzNZu2BPAv8FZwZNItKO/PidiSNAxYCFUTrpmNH0EqIAk4FG6XdWLOI0D7psqbBkARMt5LJnSN8aJQO9qUyLwQ8UX9OJCTQeh4pjMg0NPztbH4X60Zg3/WSngYxcBCOl3kxwKDxGPTcIcrRkEMDRCquPkrpj2iCAVjbdqY4Myf/BfqxYJzXChen2RLlzM7UmgfHaActApKqErVE1RNEDekKv6M16tJ6td+tj2rpgzWb20K+wvr4BFrKkLg=</latexit> Multi-class Logistic Regression • How do we extend LR to more than 2 classes? • Elegant: Can train weights using same prediction function we’ll use at test time p ( x ) = softmax( w T 1 x, w T 2 x, . . . w T ˆ C x ) Mike Hughes - Tufts COMP 135 - Fall 2020 5

  5. Kernel methods Use kernel functions (similarity function with special properties) to obtain flexible high- dimensional feature transformations without explicit features Solve “dual” problem (for parameter alpha), not “primal” problem (for weights w) Can use the “kernel trick” for: * regression * classification (Logistic Regr. or SVM) Mike Hughes - Tufts COMP 135 - Fall 2020 6

  6. Kernel Methods for Regression Kernels exist for: • Periodic regression • Histograms • Strings • Graphs, • And more! Mike Hughes - Tufts COMP 135 - Fall 2020 7

  7. Review: Key concepts in supervised learning • Parametric vs nonparametric methods • Bias vs variance Mike Hughes - Tufts COMP 135 - Fall 2020 8

  8. Parametric vs Nonparametric • Parametric methods • Complexity of decision function fixed in advance and specified by a finite fixed number of parameters, regardless of training data size Linear regression Neural networks Logistic regression • Nonparametric methods • Complexity of decision function can grow as more training data is observed Nearest neighbor methods Decision trees Ensembles of trees Mike Hughes - Tufts COMP 135 - Fall 2020 9

  9. Bias & Variance Estimate (a random variable) ˆ y Known “true” response y Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Fall 2020 10

  10. <latexit sha1_base64="h1ZEA4W0jGTPZAVN/oGQAtWEzI=">ACInicbVDLSgNBEJz1bXxFPXoZDIeDLtRUA9CUASPEUwUsmvonUzM4OyDmV4xLPstXvwVLx4U9ST4Mc7GCJpYMFBUVTPd5cdSaLTtD2tsfGJyanpmtjA3v7C4VFxeaegoUYzXWSQjdemD5lKEvI4CJb+MFYfAl/zCvznO/YtbrSIwnPsxdwL4DoUHcEAjdQqHrjI7zBtgMo23S5g2su26CF1A8Cu76cnWZP+6HSbuj6oPHFVoV6rWLdh90lDgDUiID1FrFN7cdsSTgITIJWjcdO0YvBYWCSZ4V3ETzGNgNXPOmoSEXHtp/8SMbhilTuRMi9E2ld/T6QaN0LfJPMV9fDXi7+5zUT7Ox7qQjBHnIvj/qJiRPO+aFsozlD2DAGmhNmVsi4oYGhaLZgSnOGTR0mjUnZ2ypWz3VL1aFDHDFkj62STOGSPVMkpqZE6YeSePJn8mI9WE/Wq/X+HR2zBjOr5A+szy8grqNc</latexit> Decompose into Bias & Variance is known “true” response value at given known heldout input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at x Bias : Error from average model to true y − y ) 2 (¯ y , E [ˆ ¯ y ] How far the average prediction of our model (averaged over all possible training sets of size N) is from true response Variance : h y 2 i y 2 y ) 2 ] = E Deviation over model samples Var(ˆ y ) = E [(ˆ y − ¯ ˆ − ¯ How far predictions based on a single training set are from the average prediction Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. Total Error: Bias^2 + Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i Expected value is over y 2 − 2ˆ = E ˆ samples of the observed training set h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 = E ˆ − ¯ y − y ) 2 (¯ = Var(ˆ y )+ Variance Mike Hughes - Tufts COMP 135 - Fall 2020 12

  12. Toy example: ISL Fig. 6.5 total error bias Error due to inability of typical fit (averaged over training sets) to capture true predictive relationship variance Error due to estimating from a More flexible Less flexible single finite-size training set overfitting underfitting All supervised learning methods must manage bias/variance tradeoff. Hyperparameter search is key. Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Dimensionality Reduction & Embedding Prof. Mike Hughes Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU) 14

  14. What will we learn? Supervised Learning Data Examples Performance { x n } N measure Task n =1 Unsupervised Learning summary data of x x Reinforcement Learning Mike Hughes - Tufts COMP 135 - Fall 2020 15

  15. Task: Embedding Supervised Learning x 2 Unsupervised Learning embedding Reinforcement x 1 Learning Mike Hughes - Tufts COMP 135 - Fall 2020 16

  16. Dim. Reduction/Embedding Unit Objectives • Goals of dimensionality reduction • Reduce feature vector size (keep signal, discard noise) • “Interpret” features: visualize/explore/understand • Common approaches • Principal Component Analysis (PCA) • word2vec and other neural embeddings • Evaluation Metrics • Storage size - Reconstruction error • “Interpretability” Mike Hughes - Tufts COMP 135 - Fall 2020 17

  17. Example: 2D viz. of movies Mike Hughes - Tufts COMP 135 - Fall 2020 18

  18. Example: Genes vs. geography Nature, 2008 Where possible, we based the geographic origin on the observed country data for grandparents. We used a ‘strict consensus’ approach: if all observed grandparents originated from a single country, we used that country as the origin. If an individual’s observed grandparents originated from different countries, we excluded the individual. Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Mike Hughes - Tufts COMP 135 - Fall 2020 19

  19. Example: Genes vs. geography Nature, 2008 Mike Hughes - Tufts COMP 135 - Fall 2020 20

  20. Example: Eigen Clothing Mike Hughes - Tufts COMP 135 - Fall 2020 21

  21. Mike Hughes - Tufts COMP 135 - Fall 2020 22

  22. Centering the Data Goal: each feature’s mean = 0.0 Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Why center? • Think of mean vector as simplest possible “reconstruction” of a dataset • No example specific parameters, just one F- dim vector N ( x n − m ) T ( x n − m ) X min m ∈ R F n =1 m ∗ = mean( x 1 , . . . x N ) Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. Mean reconstruction original reconstructed Mike Hughes - Tufts COMP 135 - Fall 2020 25

  25. Principal Component Analysis Mike Hughes - Tufts COMP 135 - Fall 2020 26

  26. Linear Projection to 1D Mike Hughes - Tufts COMP 135 - Fall 2020 27

  27. Reconstruction from 1D to 2D Mike Hughes - Tufts COMP 135 - Fall 2020 28

  28. 2D Orthogonal Basis Mike Hughes - Tufts COMP 135 - Fall 2020 29

  29. Which 1D projection is best? Idea: Minimize reconstruction error Mike Hughes - Tufts COMP 135 - Fall 2020 30

  30. K-dim Reconstruction with PCA x i = Wz i + m + F vector F vector F x K K vector “mean” High- Weights Low-dim vector dim. vector data Problem: Over-parameterized. Too many possible solutions! If we scale z x2, we can scale W / 2 and get equivalent reconstruction We need to constrain the magnitude of weights. Let’s make all the weight vectors be unit vectors: ||W||_2 = 1 Mike Hughes - Tufts COMP 135 - Spring 2019 31

  31. Principal Component Analysis Training step: .fit() • Input: • X : training data, N x F • N high-dim. example vectors • K : int, number of components • Satisfies 1 <= K <= F • Output: Trained parameters for PCA • m : mean vector, size F • W : learned basis of weight vectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other Mike Hughes - Tufts COMP 135 - Spring 2019 32

  32. Principal Component Analysis Transformation step: .transform() • Input: • X : training data, N x F • N high-dim. example vectors • Trained PCA “model” • m : mean vector, size F • W : learned basis of eigenvectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other • Output: • Z : projected data, N x K Mike Hughes - Tufts COMP 135 - Spring 2019 33

  33. Example: EigenFaces Mike Hughes - Tufts COMP 135 - Fall 2020 34

Recommend


More recommend