Learning From Data Lecture 16 Similarity and Nearest Neighbor - PowerPoint PPT Presentation

Learning From Data Lecture 16 Similarity and Nearest Neighbor Similarity Nearest Neighbor M. Magdon-Ismail CSCI 4100/6100

My 5-Year-Old Called It “A ManoHorse” The simplest method of learning that we know. Classify according to similar objects you have seen. M Similarity and Nearest Neighbor : 2 /16 � A c L Creator: Malik Magdon-Ismail Measuring similarity − →

Measuring Similarity − − − − features, x − − − − − → | x − x ′ | d ( x , x ′ ) = | | M Similarity and Nearest Neighbor : 3 /16 � A c L Creator: Malik Magdon-Ismail Euclidean distance − →

Measuring Similarity − − − − features, x − − − − − → | x − x ′ | d ( x , x ′ ) = | | M Similarity and Nearest Neighbor : 4 /16 � A c L Creator: Malik Magdon-Ismail Nearest neighbor − →

Nearest Neighbor Test ‘ x ’ is classified using its nearest neighbor. d ( x , x [1] ) ≤ d ( x , x [2] ) ≤ · · · ≤ d ( x , x [ N ] ) x [2] x [1] x g ( x ) = y [1] ( x ) x [3] x [4] No training needed! E in = 0 Nearest neighbor Voronoi tesselation M Similarity and Nearest Neighbor : 5 /16 � A c L Creator: Malik Magdon-Ismail No training − →

Nearest Neighbor Test ‘ x ’ is classified using its nearest neighbor. d ( x , x [1] ) ≤ d ( x , x [2] ) ≤ · · · ≤ d ( x , x [ N ] ) x [2] x [1] x g ( x ) = y [1] ( x ) x [3] x [4] No training needed! E in = 0 Nearest neighbor Voronoi tesselation M Similarity and Nearest Neighbor : 6 /16 � A c L Creator: Malik Magdon-Ismail E in = 0 − →

Nearest Neighbor Test ‘ x ’ is classified using its nearest neighbor. d ( x , x [1] ) ≤ d ( x , x [2] ) ≤ · · · ≤ d ( x , x [ N ] ) x [2] x [1] x g ( x ) = y [1] ( x ) x [3] x [4] No training needed! E in = 0 Nearest neighbor Voronoi tesselation M Similarity and Nearest Neighbor : 7 /16 � A c L Creator: Malik Magdon-Ismail What about E out ? − →

What about E out ? Theorem: E out ≤ 2 E ∗ (with high probability as N → ∞ ) out VC analysis: E in is an estimate for E out . Nearest neighbor analysis: E in = 0, E out is small. So we will never know what E out is, but it cannot be much worse than the best anyone can do . Half the classification power of the data is in the nearest neighbor M Similarity and Nearest Neighbor : 8 /16 � A c L Creator: Malik Magdon-Ismail Proving E out ≤ 2 E ∗ out − →

Proving E out ≤ 2 E ∗ out π ( x ) = P [ y = +1 | x ] . ← the target in logistic regression N →∞ N →∞ Assume π ( x ) is continuous and x [1] − → x . Then π ( x [1] ) − → π ( x ). P [ g N ( x ) � = y ] = P [ y = +1 , y [1] = − 1] + P [ y = − 1 , y [1] = +1] , = π ( x ) · (1 − π ( x [1] )) + (1 − π ( x )) · π ( x [1] ) , → π ( x ) · (1 − π ( x )) + (1 − π ( x )) · π ( x ) , = 2 π ( x ) · (1 − π ( x )) , ≤ 2 min { π ( x ) , 1 − π ( x ) } . The best you can do is E ∗ out ( x ) = min { π ( x ) , 1 − π ( x ) } . M Similarity and Nearest Neighbor : 9 /16 � A c L Creator: Malik Magdon-Ismail Nearest neighbor ‘self-regularizes’ − →

Nearest Neighbor ‘Self-Regularizes’ N = 2 N = 3 N = 4 N = 5 N = 6 A simple boundary is used with few data points. A more complicated boundary is possible only when you have more data points. regularization guides you to simpler hypotheses when data quality/quantity is lower. M Similarity and Nearest Neighbor : 10 /16 � A c L Creator: Malik Magdon-Ismail k -nearest neighbor − →

k -Nearest Neighbor � k � � g ( x ) = sign y [ i ] ( x ) . i =1 ( k is odd and y n = ± 1). 1-NN rule 21-NN rule 127-NN rule M Similarity and Nearest Neighbor : 11 /16 � A c L Creator: Malik Magdon-Ismail The role of k − →

The Role of k k determines the tradeoff between fitting the data and overfitting the data. Theorem. For N → ∞ , if k ( N ) → ∞ and k ( N ) /N → 0 then, E out ( g ) → E ∗ E in ( g ) → E out ( g ) and out . � √ � For example k = N . M Similarity and Nearest Neighbor : 12 /16 � A c L Creator: Malik Magdon-Ismail 3 Ways To Choose k − →

3 Ways To Choose k 2 1. k = 3. � √ E out (%) � 2. k = N . 1.5 k = 1 k = 3 √ 3. Validation or cross validation: k = N 1 CV k -NN rule hypotheses g k constructed on training set, tested on validation set, and best k is picked. 0 1000 2000 3000 4000 5000 # Data Points, N M Similarity and Nearest Neighbor : 13 /16 � A c L Creator: Malik Magdon-Ismail Nearest neighbor is nonparametric − →

Nearest Neighbor is Nonparametric NN-rule Linear Model no parameters ( d + 1) parameters expressive/flexible rigid, always linear g ( x ) needs data g ( x ) needs only weights generic, can model anything specialized M Similarity and Nearest Neighbor : 14 /16 � A c L Creator: Malik Magdon-Ismail Multiclass − →

Nearest Neighbor Easily Extends to Multiclass 0 1 1 2 3 Symmetry Symmetry 4 4 5 0 9 6 8 7 3 8 9 7 2 6 Average Intensity Average Intensity True Predicted 0 1 2 3 4 5 6 7 8 9 0 13.5 0.5 0.5 1 0 0.5 0 0 0.5 0 16.5 1 0.5 13.5 0 0 0 0 0 0 0 0 14 2 0.5 0 3.5 1 1 1.5 1 1 0 0.5 10 3 2.5 0 1.5 2 0.5 0.5 0.5 0.5 0.5 1 9.5 41% accuracy! 4 0.5 0 1 0.5 1.5 0.5 1 2 0 1.5 8.5 5 0.5 0 2.5 1 0.5 1.5 1 1 0 0.5 7.5 6 0.5 0 2 1 1 1 1 1 0 1 8.5 7 0 0 1.5 0.5 1.5 0.5 1 3 0 1 9 8 3.5 0 0.5 1 0.5 0.5 0.5 0 0.5 1 8 9 0.5 0 1 1 1 0.5 1 1 0.5 2 8.5 22.5 14 14 9 7.5 7 7 9.5 2 8.5 100 M Similarity and Nearest Neighbor : 15 /16 � A c L Creator: Malik Magdon-Ismail Summary − →

Highlights of k -Nearest Neighbor } 1. Simple. 2. No training. A good! method 3. Near optimal E out . 4. Easy to justify classification to customer. 5. Can easily do multi-class. 6. Can easily adapt to regression or logistic regression k k g ( x ) = 1 g ( x ) = 1 � � � � y [ i ] ( x ) y [ i ] ( x ) = +1 k k i =1 i =1 7. Computationally demanding . ← − we will address this next M Similarity and Nearest Neighbor : 16 /16 � A c L Creator: Malik Magdon-Ismail

Learning From Data Lecture 16 Similarity and Nearest Neighbor - PowerPoint PPT Presentation

Learning From Data Lecture 16 Similarity and Nearest Neighbor Similarity Nearest Neighbor M. Magdon-Ismail CSCI 4100/6100 My 5-Year-Old Called It A ManoHorse The simplest method of learning that we know. Classify according to similar

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Learning From Data Lecture 17 Memory and Efficiency in Nearest Neighbor Memory Efficiency M.

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Simple and Fast Nearest Neighbor Search Marcel Birn, Manuel Holtgrewe, Peter Sanders , Johannes

9/28/2009 Nearest Neighbor Queries What are the two nearest stars to Andromeda? Reverse

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

11 11 11 Learning to Route in Similarity Graphs Dmitry Baranchuk joint work with Dmitry

Nearest Neighbor Classification Machine Learning 1 This lecture K-nearest neighbor

A quick review The parsimony principle: Find the tree that requires the fewest

Learning Nearest Neighbor Graphs from Noisy Distance Samples Noisy Distance Samples Blake Mason,

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

Euclidean Distance Geometry Leo Liberti IBM Research, USA CNRS LIX Ecole Polytechnique, France

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

in a Euclidean Space Jiaoyang Li, Ariel Felner, Sven Koenig, T. K. Satish Kumar Berkeley, CA

E x ploring the MNIST dataset AD VAN C E D D IME N SION AL ITY R E D U C TION IN R Federico

Energy minimization for periodic sets in Euclidean spaces Renaud Coulangeon, joint work with