Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem - PowerPoint PPT Presentation

Lecture 3: − Kernel Regression − Curse of Dimensionality Aykut Erdem February 2016 Hacettepe University

Administrative • Assignment 1 will be out on Thursday • It is due March 4 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kNN classifier − Implementing linear regression − numpy/Python code • Note: Lecture slides are not enough, you should also read related book chapters! 2

Recall from last time… Nearest Neighbors • Very simple method • Retain all training data − It can be slow in testing − Finding NN in high dimensions is slow adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson • Metrics are very important • Good baseline 3

Classification • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.)   • Output: Y - Discrete (0,1,2,...) slide by Aarti Singh and Barnabas Poczos Sports% Anemic%cell% Science% Healthy%cell% News% Y'='Diagnosis' X'='Document' Y'='Topic' X'='Cell'Image' 4

Regression • Input: X - Real valued, vectors over real. - Discrete values (0,1,2,…) - Other structures (e.g., strings, graphs, etc.)   • Output: Y slide by Aarti Singh and Barnabas Poczos - Real valued, vectors over real. Stock%Market%% t%% Predic$on% Y'='?' X'='Feb01'' 5

What should I watch tonight? slide by Sanja Fidler 6

Today • Kernel regression − nonparametric   • Distances • Next: Linear regression − parametric − simple model 9

    Simple 1-D Regression • Circles are data points (i.e., training examples) that are given to us • The data points are uniform in x , but may be displaced in y   t ( x ) = f ( x ) + ε   slide by Sanja Fidler with ε some noise • In green is the “true” curve that we don’t know 10

Kernel Regression 11

1-NN for Regression Here, this is the Here, this is the closest closest datapoint datapoint Here, this is the closest Here, this is the datapoint closest datapoint y x slide by Dhruv Batra 12 Figure Credit: Carlos Guestrin

1-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 13 Figure Credit: Andrew Moore

9-NN for Regression • Often bumpy (overfits) slide by Dhruv Batra 14 Figure Credit: Andrew Moore

Weighted K-NN for Regression • Given: Training data ( ( 𝑦 1 , 𝑧 1 ),…, ( 𝑦 n , 𝑧 n ))   𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • – Attribute vectors: 𝑦 𝑗 ∈ 𝑌   𝑦 𝑗 ∈ 𝑌 – – Target attribute 𝑧 𝑗 ∈ R 𝑧 𝑗 ∈ ℜ – • Parameter:   • – Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →   R 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – Number of nearest neighbors to consider: k – • Prediction rule   • – New example 𝑦 ’   – x’ – K-nearest neighbors: k train examples with largest 𝐿 ( 𝑦 𝑗 , 𝑦 ’) ′ 𝐿 𝑦 𝑗 , 𝑦 – slide by Thorsten Joachims 15

Multivariate distance metrics • Suppose the input vectors x 1 , x 2 , … x N are two dimensional: x 1 = ( x 11 , x 12 ) , x 2 = ( x 21 , x 22 ) , … x N = ( x N1 , x N2 ). • One can draw the nearest-neighbor regions in input space. Dist ( x i , x j ) = ( x i1 – x j1 ) 2 + ( x i2 – x j2 ) 2 Dist ( x i , x j ) =( x i1 – x j1 ) 2 +( 3x i2 – 3x j2 ) 2 slide by Dhruv Batra The relative scalings in the distance metric affect region shapes 16 Slide Credit: Carlos Guestrin

Example: Choosing a restaurant • In everyday life we need to make • Reviews $ Distance Cuisine • decisions by taking into account (out of 5 (out of 10) stars) lots of factors • • 4 30 21 7 • The question is what weight we put 2 15 12 8 on each of these factors (how • 5 27 53 9 • important are they with respect to individuals’ ¡preferences 3 20 5 6 individuals’ ¡preferences the others). • • ? slide by Richard Zemel 17

Euclidean distance metric Or equivalently, where A slide by Dhruv Batra 18 Slide Credit: Carlos Guestrin

Notable distance metrics   (and their level sets) Mahalanobis   Scaled Euclidian (L 2 ) slide by Dhruv Batra (non-diagonal A) 19 Slide Credit: Carlos Guestrin

Minkowski distance ! 1 /p n slide by Dhruv Batra X | x i − y i | p D = i =1 Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)   20 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Notable distance metrics   (and their level sets) Scaled Euclidian (L 2 ) L 1 norm (absolute) L inf (max) norm slide by Dhruv Batra 21 Slide Credit: Carlos Guestrin

Kernel Regression/Classification Four things make a memory based learner: • A distance metric − Euclidean (and others) • How many nearby neighbors to look at? − All of them • A weighting function (optional) − w i = exp(-d(x i , query) 2 / σ 2 ) − Nearby points to the query are weighted strongly, far points weakly.   The σ parameter is the Kernel Width. Very important. • How to fit with the local points? − Predict the weighted average of the outputs predict = Σ w i y i / Σ w i slide by Dhruv Batra 22 Slide Credit: Carlos Guestrin

Weighting/Kernel functions w i = exp(-d(x i , query) 2 / σ 2 ) slide by Dhruv Batra (Our examples use Gaussian) 23 Slide Credit: Carlos Guestrin

E ff ect of Kernel Width • What happens as σ → inf? • What happens as σ → 0? slide by Dhruv Batra Image Credit: Ben Taskar 24

Problems with Instance- Based Learning • Expensive − No Learning: most real work done during testing − For every test sample, must search through all dataset – very slow! − Must use tricks like approximate nearest neighbour search • Doesn’t work well when large number of irrelevant features − Distances overwhelmed by noisy features slide by Dhruv Batra • Curse of Dimensionality − Distances become meaningless in high dimensions 25

Curse of Dimensionality • Consider applying a KNN classifier/regressor to data where the inputs are uniformly distributed 1 in the D -dimensional unit cube. • Suppose we estimate the density of class labels around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points. 0 s • The expected edge length of this cube will be 1 1/ D . 1 e D ( f ) = f 0.9 d=10 d=7 0.8 d=5 • If D = 10 , and we want to base our estimate on 0.7 d=3 Edge length of cube 10% of the data, we have e 10 (0.1) = 0.8 , so we 0.6 0.5 need to extend the cube 80% along each 0.4 slide by Kevin Murphy dimension around x . d=1 0.3 0.2 0.1 • Even if we only use 1% of the data, we find 0 — no longer very local 0 0.2 0.4 0.6 0.8 1 e 10 (0.01) = 0.63. Fraction of data in neighborhood 26

Next Lecture: Linear Regression 27

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem - PowerPoint PPT Presentation

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem February 2016 Hacettepe University Administrative Assignment 1 will be out on Thursday It is due March 4 (i.e. in two weeks). It includes Pencil-and-paper

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

Optimization MS Maths Big Data Alexandre Gramfort alexandre.gramfort@telecom-paristech.fr

Lecture 08: Ridge Regression, Equivalent Formulations and KKT Conditions Instructor: Prof. Ganesh

Survey of Machine Learning Methods Pedro Rodriguez CU Boulder PhD Student in Large-Scale Machine

5. Summary of linear regression so far Main points Model/function/predictor class of linear

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel

Regression with Many Predictors 21.12.2016 Goals of Todays Lecture Get a (limited) overview

Math 211 Math 211 Lecture #2 2 Autonomous Equations Autonomous Equations General equation: