0. Instance Based Learning Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 8 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
1. Key ideas training: simply store all training examples classification: compute only locally the target function Advantage: it can prove useful in case of very complex target functions Disadvantages: 1. can be computationally costly 2. usually considers all attributes
2. Methods 1. k -Nearest Neighbor 2. Locally weighted regression a generalization of k -NN 3. Radial basis functions combining instance-based learning and neural networks 4. Case-based reasoning symbolic representations and knowledge-based inference
3. 1. k -Nearest Neighbor Learning Given a query instance x q , estimate ˆ f ( x q ) : • in case of discrete-valued f : ℜ n → V , take a vote among its k nearest neighbors k ˆ f ( x q ) ← argmax � δ ( v, f ( x i )) v ∈ V i =1 where δ ( a, b ) = 1 if a = b , and δ ( a, b ) = 0 if a � = b • in case of continuous-valued f , take the mean of the f values of its k nearest neighbors � k i =1 f ( x i ) ˆ f ( x q ) ← k
4. Illustratring k -NN: Voronoi Diagram − − − + + x q − + + − Note that The decision surface in- 1-NN classifies x q as + duced by 1-NN for a set 5-NN classifies x q as − of training examples.
5. When To Consider k -Nearest Neighbor • Instances map to points in ℜ n • Less than 20 attributes per instance • Lots of training data Advantages: Disadvantages: training is very fast slow at query time learn complex target functions easily fooled by irrelevant attributes don’t lose information robust to noisy training k -NN Inductive Bias: The classification of x q will be most sim- ilar to the classification of other instances that are nearby
6. k -NN: Behavior in the Limit Let p ( x ) be the probability that the instance x will be labeled 1 (positive) versus 0 (negative) k -Nearest neighbor: • As the number of training examples → ∞ and k gets large, k -NN approaches the Bayes optimal learner Bayes optimal: if p ( x ) > 0 . 5 then predict 1, else 0 Nearest neighbor ( k = 1 ): • As number of training examples → ∞ , 1 -NN approaches the Gibbs algorithm Gibbs algorithm: with probability p ( x ) predict 1, else 0
7. k -NN: The Curse of Dimensionality k -NN is easily mislead when X is highly-dimensional, i.e. irrelevant attributes may dominate the decision!! Example: Imagine instances described by n = 20 attributes, but only 2 are relevant to the target function. Instances that have identical values for the 2 attributes may be distant from x q in the 20-dimensional space. Solution: • Stretch the j -th axis by weight z j , where z 1 , . . ., z n are cho- sen so to minimize the prediction error • Use an approach similar to cross-validation to automati- cally choose values for the weights z 1 , . . . , z n • Note that setting z j to zero eliminates this dimension al- together
8. Efficient memory indexing for the retrieval of the nearest neighbors kd -trees ([Bentley, 1975] [Friedman, 1977]) Each leaf node stores a training instance. Nearby instances are stored at the same (or nearby) nodes. The internal nodes of the tree sort the new query x q to the relevant leaf by testing selected attributes of x q .
9. 1 ′ . A k -NN Variant: Distance-Weighted k -NN We might want to weight nearer neighbors more heavily: ˆ � k • for discrete-valued f : f ( x q ) ← argmax v ∈ V i =1 w i δ ( v, f ( x i )) where 1 w i ≡ d ( x q ,x i ) 2 d ( x q , x i ) is the distance between x q and x i but if x q = x i we take ˆ f ( x q ) ← f ( x i ) � k ˆ i =1 w i f ( x i ) f ( x q ) ← • for continuous-valued f : � k i =1 w i Remark: Now it makes sense to use all training examples instead of just k . In this case k -NN is known as Shepard’s method.
10. 2. Locally Weighted Regression Note that k -NN forms a local approximation to f for each query point x q Why not form an explicit approximation ˆ f ( x ) for the region surrounding x q : • Fit a linear function (or: a quadratic function, a multi- layer neural net, etc.) to k nearest neighbors ˆ f = w 0 + w 1 a 1 ( x ) + . . . + w n a n ( x ) where a 1 ( x ) , . . ., a n ( x ) are the attributes of the instance x . • Produce a “piecewise approximation” to f
Minimizing the Error in Locally Weighted Regression 11. • Squared error over k nearest neighbors E 1 ( x q ) ≡ 1 ( f ( x ) − ˆ � f ( x )) 2 2 x ∈ k nearest nbrs of x q • Distance-weighted squared error over all neighbors E 2 ( x q ) ≡ 1 f ( x )) 2 K ( d ( x q , x )) ( f ( x ) − ˆ � 2 x ∈ D where the “kernel” function K decreases over d ( x q , x ) • A combination of the above two: E 3 ( x q ) ≡ 1 f ( x )) 2 K ( d ( x q , x )) ( f ( x ) − ˆ � 2 x ∈ k nearest nbrs of x q In this case, applying the gradient descent method, we obtain the training rule K ( d ( x q , x ))( f ( x ) − ˆ � ∆ w j = η f ( x )) a j ( x ) x ∈ k nearest nbrs of x q
12. 3. Radial Basis Function Networks • Compute a global approximation to the target function f , in terms of linear combination of local approximations (“kernel” functions) • Closely related to distance-weighted regression, but “ea- ger” instead of “lazy”. (See last slide.) • Can be thought of as a different kind of (two-layer) neural networks: The hidden units compute the values of kernel functions. The output unit computes f as a liniar combination of kernel functions. • Used, e.g. for image classification, where the assumption of spatially local influencies is well-justified
13. Radial Basis Function Networks f(x) Target function: k f ( x ) = w 0 + w u K u ( d ( x u , x )) � u =1 w 0 w w k 1 The kernel functions are com- ... 1 monly chosen as Gaussians: 1 u d 2 ( x u ,x ) − K u ( d ( x u , x )) ≡ e 2 σ 2 ... The activation of hidden units will be close to 0 unless x is close to x u . a (x) a (x) a (x) 1 2 n The two layers are trained sepa- rately (therefore more efficiently a i are the attributes de- than in NNs). scribing the instances.
14. [ Hartman et al. , 1990 ] Theorem: The function f can be approximated with arbitrarily small error, provided – a sufficiently large k , and – the width σ 2 u of each kernel K u can be separately speci- fied.
15. Training Radial Basis Function Networks Q1: What x u to use for each kernel function K u ( d ( x u , x )) • Scatter uniformly throughout instance space • Or use training instances (reflects instance distribution) • Or form prototypical clusters of instances (take one K u centered at each cluster) Q2: How to train the weights (assume here Gaussian K u ) • First choose mean (and perhaps variance) for each K u , using e.g. the EM algorithm • Then hold K u fixed, and train linear output layer to get w i
16. 4. Case-Based Reasoning Case-Based Reasoning is instance-based learning applied to in- stance spaces X � = ℜ n , usually with symbolic logic descriptions. For this case we need a different “distance” metric. It was applied to • conceptual design of mechanical devices, based on a stored library of previous designs • reasoning about new legal cases, based on previous rulings • scheduling problems, by reusing/combining portions of solutions to similar prob- lems
17. The CADET Case-Based Reasoning System Use 75 stored examples of mechanical devices • each training example: � qualitative function, mechanical structure � using rich structural descriptions • new query: desired function target value: mechanical structure for this function Distance metric: match qualitative function descriptions Problem solving: multiple cases are retrieved, combined, and eventually extended to form a solution to the new problem
Case-Based Reasoning in CADET 18. A stored case: T−junction pipe Structure: Function: Q ,T = temperature T Q 1 1 + = waterflow Q 1 Q 3 Q + 2 Q ,T 3 3 T + 1 T 3 Q ,T T + 2 2 2 A problem specification: Water faucet Structure: Function: + ? C Q + + t c Q + + m C Q + − f h + + T c T m T + h
19. Lazy Learning vs. Eager Learning Algorithms Lazy: wait for query before generalizing ◦ k -Nearest Neighbor, Locally weighted regression, Case based rea- soning • Can create many local approximations Eager: generalize before seeing query ◦ Radial basis function networks, ID3, Backpropagation, Naive Bayes, . . . • Must create global approximation Does it matter? If they use same H , lazy learners can represent more complex func- tions. E.g., a lazy Backpropagation algorithm can learn a NN which is dif- ferent for each query point, compared to the eager version of Back- propagation.
Recommend
More recommend