Instance Based Learning Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Instance Based Learning Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 8 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

1. Key ideas training: simply store all training examples classification: compute only locally the target function Advantage: it can prove useful in case of very complex target functions Disadvantages: 1. can be computationally costly 2. usually considers all attributes

2. Methods 1. k -Nearest Neighbor 2. Locally weighted regression a generalization of k -NN 3. Radial basis functions combining instance-based learning and neural networks 4. Case-based reasoning symbolic representations and knowledge-based inference

3. 1. k -Nearest Neighbor Learning Given a query instance x q , estimate ˆ f ( x q ) : • in case of discrete-valued f : ℜ n → V , take a vote among its k nearest neighbors k ˆ f ( x q ) ← argmax � δ ( v, f ( x i )) v ∈ V i =1 where δ ( a, b ) = 1 if a = b , and δ ( a, b ) = 0 if a � = b • in case of continuous-valued f , take the mean of the f values of its k nearest neighbors � k i =1 f ( x i ) ˆ f ( x q ) ← k

4. Illustratring k -NN: Voronoi Diagram − − − + + x q − + + − Note that The decision surface in- 1-NN classifies x q as + duced by 1-NN for a set 5-NN classifies x q as − of training examples.

5. When To Consider k -Nearest Neighbor • Instances map to points in ℜ n • Less than 20 attributes per instance • Lots of training data Advantages: Disadvantages: training is very fast slow at query time learn complex target functions easily fooled by irrelevant attributes don’t lose information robust to noisy training k -NN Inductive Bias: The classification of x q will be most similar to the classification of other instances that are nearby

6. k -NN: Behavior in the Limit Let p ( x ) be the probability that the instance x will be labeled 1 (positive) versus 0 (negative) k -Nearest neighbor: • As the number of training examples → ∞ and k gets large, k -NN approaches the Bayes optimal learner Bayes optimal: if p ( x ) > 0 . 5 then predict 1, else 0 Nearest neighbor ( k = 1 ): • As number of training examples → ∞ , 1 -NN approaches the Gibbs algorithm Gibbs algorithm: with probability p ( x ) predict 1, else 0

7. k -NN: The Curse of Dimensionality k -NN is easily mislead when X is highly-dimensional, i.e. irrelevant attributes may dominate the decision!! Example: Imagine instances described by n = 20 attributes, but only 2 are relevant to the target function. Instances that have identical values for the 2 attributes may be distant from x q in the 20-dimensional space. Solution: • Stretch the j -th axis by weight z j , where z 1 , . . ., z n are chosen so to minimize the prediction error • Use an approach similar to cross-validation to automati- cally choose values for the weights z 1 , . . . , z n • Note that setting z j to zero eliminates this dimension al- together

8. Efficient memory indexing for the retrieval of the nearest neighbors kd -trees ([Bentley, 1975] [Friedman, 1977]) Each leaf node stores a training instance. Nearby instances are stored at the same (or nearby) nodes. The internal nodes of the tree sort the new query x q to the relevant leaf by testing selected attributes of x q .

9. 1 ′ . A k -NN Variant: Distance-Weighted k -NN We might want to weight nearer neighbors more heavily: ˆ � k • for discrete-valued f : f ( x q ) ← argmax v ∈ V i =1 w i δ ( v, f ( x i )) where 1 w i ≡ d ( x q ,x i ) 2 d ( x q , x i ) is the distance between x q and x i but if x q = x i we take ˆ f ( x q ) ← f ( x i ) � k ˆ i =1 w i f ( x i ) f ( x q ) ← • for continuous-valued f : � k i =1 w i Remark: Now it makes sense to use all training examples instead of just k . In this case k -NN is known as Shepard’s method.

10. 2. Locally Weighted Regression Note that k -NN forms a local approximation to f for each query point x q Why not form an explicit approximation ˆ f ( x ) for the region surrounding x q : • Fit a linear function (or: a quadratic function, a multi- layer neural net, etc.) to k nearest neighbors ˆ f = w 0 + w 1 a 1 ( x ) + . . . + w n a n ( x ) where a 1 ( x ) , . . ., a n ( x ) are the attributes of the instance x . • Produce a “piecewise approximation” to f

Minimizing the Error in Locally Weighted Regression 11. • Squared error over k nearest neighbors E 1 ( x q ) ≡ 1 ( f ( x ) − ˆ � f ( x )) 2 2 x ∈ k nearest nbrs of x q • Distance-weighted squared error over all neighbors E 2 ( x q ) ≡ 1 f ( x )) 2 K ( d ( x q , x )) ( f ( x ) − ˆ � 2 x ∈ D where the “kernel” function K decreases over d ( x q , x ) • A combination of the above two: E 3 ( x q ) ≡ 1 f ( x )) 2 K ( d ( x q , x )) ( f ( x ) − ˆ � 2 x ∈ k nearest nbrs of x q In this case, applying the gradient descent method, we obtain the training rule K ( d ( x q , x ))( f ( x ) − ˆ � ∆ w j = η f ( x )) a j ( x ) x ∈ k nearest nbrs of x q

12. 3. Radial Basis Function Networks • Compute a global approximation to the target function f , in terms of linear combination of local approximations (“kernel” functions) • Closely related to distance-weighted regression, but “eager” instead of “lazy”. (See last slide.) • Can be thought of as a different kind of (two-layer) neural networks: The hidden units compute the values of kernel functions. The output unit computes f as a liniar combination of kernel functions. • Used, e.g. for image classification, where the assumption of spatially local influencies is well-justified

13. Radial Basis Function Networks f(x) Target function: k f ( x ) = w 0 + w u K u ( d ( x u , x )) � u =1 w 0 w w k 1 The kernel functions are com- ... 1 monly chosen as Gaussians: 1 u d 2 ( x u ,x ) − K u ( d ( x u , x )) ≡ e 2 σ 2 ... The activation of hidden units will be close to 0 unless x is close to x u . a (x) a (x) a (x) 1 2 n The two layers are trained separately (therefore more efficiently a i are the attributes de- than in NNs). scribing the instances.

14. [ Hartman et al. , 1990 ] Theorem: The function f can be approximated with arbitrarily small error, provided – a sufficiently large k , and – the width σ 2 u of each kernel K u can be separately speci- fied.

15. Training Radial Basis Function Networks Q1: What x u to use for each kernel function K u ( d ( x u , x )) • Scatter uniformly throughout instance space • Or use training instances (reflects instance distribution) • Or form prototypical clusters of instances (take one K u centered at each cluster) Q2: How to train the weights (assume here Gaussian K u ) • First choose mean (and perhaps variance) for each K u , using e.g. the EM algorithm • Then hold K u fixed, and train linear output layer to get w i

16. 4. Case-Based Reasoning Case-Based Reasoning is instance-based learning applied to instance spaces X � = ℜ n , usually with symbolic logic descriptions. For this case we need a different “distance” metric. It was applied to • conceptual design of mechanical devices, based on a stored library of previous designs • reasoning about new legal cases, based on previous rulings • scheduling problems, by reusing/combining portions of solutions to similar problems

17. The CADET Case-Based Reasoning System Use 75 stored examples of mechanical devices • each training example: � qualitative function, mechanical structure � using rich structural descriptions • new query: desired function target value: mechanical structure for this function Distance metric: match qualitative function descriptions Problem solving: multiple cases are retrieved, combined, and eventually extended to form a solution to the new problem

Case-Based Reasoning in CADET 18. A stored case: T−junction pipe Structure: Function: Q ,T = temperature T Q 1 1 + = waterflow Q 1 Q 3 Q + 2 Q ,T 3 3 T + 1 T 3 Q ,T T + 2 2 2 A problem specification: Water faucet Structure: Function: + ? C Q + + t c Q + + m C Q + − f h + + T c T m T + h

19. Lazy Learning vs. Eager Learning Algorithms Lazy: wait for query before generalizing ◦ k -Nearest Neighbor, Locally weighted regression, Case based reasoning • Can create many local approximations Eager: generalize before seeing query ◦ Radial basis function networks, ID3, Backpropagation, Naive Bayes, . . . • Must create global approximation Does it matter? If they use same H , lazy learners can represent more complex functions. E.g., a lazy Backpropagation algorithm can learn a NN which is different for each query point, compared to the eager version of Back- propagation.

Instance Based Learning Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Instance Based Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 8 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Key ideas training: simply store all training examples

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Instance-based Learning Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

Learning for Categorization Sample Category Learning Problem A training example is an instance

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Minimum Power Dominating Sets of Random Cubic Graphs

Sunset4 WG Internet Area IETF 84, Vancouver, BC Wes George Marc Blanchet

Programming Languages Nate Nystrom University of Lugano Amanj Sherwany

CS 839 Scribing Liang Shang, Siyang Chen 1 Introduction We are introducing unsupervised data

PART ONE OF TWO Genesis 22:15-18 And the angel of the LORD called unto Abraham out of heaven the

Complementi di Controlli Automatici Controllo dei robot mobili Prof. Giuseppe Oriolo DIS,

Pharmacy interw rweaving safety within hospital health information technology

Outline Motivation & Overview H ? Towards UV embeddings of a composite Higgs: Models

Instance Based Learning Based on Machine Learning, T. Mitchell, - PowerPoint PPT Presentation

0. Instance Based Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 8 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Key ideas training: simply store all training examples

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Instance-based Learning Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

Learning for Categorization Sample Category Learning Problem A training example is an instance

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO &amp; PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Minimum Power Dominating Sets of Random Cubic Graphs

Sunset4 WG Internet Area IETF 84, Vancouver, BC Wes George Marc Blanchet

Programming Languages Nate Nystrom University of Lugano Amanj Sherwany

CS 839 Scribing Liang Shang, Siyang Chen 1 Introduction We are introducing unsupervised data

PART ONE OF TWO Genesis 22:15-18 And the angel of the LORD called unto Abraham out of heaven the

Complementi di Controlli Automatici Controllo dei robot mobili Prof. Giuseppe Oriolo DIS,

Pharmacy interw rweaving safety within hospital health information technology

Outline Motivation &amp; Overview H ? Towards UV embeddings of a composite Higgs: Models

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE

Outline Motivation & Overview H ? Towards UV embeddings of a composite Higgs: Models