CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott - PDF document

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchell’s slides) November 14, 2006 1

Outline • k -Nearest Neighbor • Locally weighted regression • Radial basis functions • Case-based reasoning • Lazy and eager learning 2

Nearest Neighbor Key idea: just store all training examples � x i , f ( x i ) � Need some distance measure between instances (e.g. Eu- clidean distance, Hamming distance) Nearest neighbor: • Given query instance x q , first locate nearest training example x n , then estimate ˆ f ( x q ) = f ( x n ) k -Nearest neighbor: • Given x q , take vote among its k nearest neighbors (if discrete-valued target function) – Let k not be divisible by number of possible labels • Take mean of f values of k nearest neighbors if f real-valued � k i =1 f ( x i ) ˆ f ( x q ) = k 3

Voronoi Diagram Decision surface for 1-NN − − − + + x q − + + − 4

When To Consider Nearest Neighbor • Instances map to points in ℜ n (or, at least, one can define some distance measure between instances) • Less than 20 attributes per instance – To avoid curse of dimensionality, where many irrelevant attributes causes distance to be large, but distance is small if only relevant attributes used – Also, large number of attributes increases classification complexity • Lots of training data Advantages: • Robust to noise • Stable • Training is very fast • Learn complex target functions • Don’t lose information Disadvantages: • Slow at query time (active research area: fast index- ing and accessing algorithms) • Easily fooled by irrelevant attributes 5

Nearest Neighbor’s Behavior in the Limit Consider p ( x ) defines probability that instance x will be labeled 1 (positive) versus 0 (negative). Nearest neighbor ( k = 1 ): • As number of training examples → ∞ , approaches Gibbs Algorithm Recall Gibbs has at most twice the expected error of Bayes optimal k -Nearest neighbor: • As number of training examples → ∞ and k gets large, approaches Bayes optimal (best possible with given hyp. space and prior information) Bayes optimal: if p ( x ) > . 5 then predict 1, else 0 6

Distance-Weighted k -NN Might want weight nearer neighbors more heavily: k ˆ � f ( x q ) ← argmax w i δ ( v, f ( x i )) v ∈ V i =1 for discrete-valued ( δ ( v, f ( x i )) = 1 if v = f ( x i ) and 0 otherwise), and � k i =1 w i f ( x i ) ˆ f ( x q ) ← � k i =1 w i for continuous where 1 w i ≡ d ( x q , x i ) 2 and d ( x q , x i ) is distance between x q and x i Note now it makes sense to use all training examples instead of just k (Shepard’s method), but then get increased time to classify instances 7

Curse of Dimensionality Imagine instances described by 20 attributes, but only 2 are relevant to target function Curse of dimensionality: nearest neighbor is easily misled by high-dimensional X One approach: • Stretch j th axis by weight z j , where z 1 , . . . , z n chosen to minimize prediction error • Use cross-validation to automatically choose weights z 1 , . . . , z n • Note setting z j to zero eliminates this dimension alto- gether see [Moore and Lee, 1994] 8

Locally Weighted Regression Note k -NN forms local approximation to f for each query point x q Why not form an explicit approximation ˆ f ( x ) for region surrounding x q ? • Fit linear, quadratic, etc. function to k nearest neighbors • Produces “piecewise approximation” to f • Do this for each new query point x q Several choices of error to minimize: • Squared error over k nearest neighbors E 1 ( x q ) ≡ 1 � 2 � f ( x ) − ˆ � f ( x ) 2 x ∈ k nearest nbrs of x q • Distance-weighted squared error over all nbrs E 2 ( x q ) ≡ 1 � 2 K ( d ( x q , x )) � f ( x ) − ˆ � f ( x ) 2 x ∈ D ( K is decreasing in its argument) • Combine E 1 and E 2 9

Radial Basis Function (RBF) Networks • Global approximation to target function, in terms of linear combination of local approximations • Used, e.g., for image classification • A different kind of neural network • Closely related to distance-weighted regression, but “eager” instead of “lazy” 10

RBF Networks (cont’d) f(x) w 0 w w k 1 ... 1 ... a (x) a (x) a (x) 1 2 n where a i ( x ) are the attributes describing instance x , and k ˆ � f ( x ) = w 0 + w u K u ( d ( x u , x )) u =1 (Note no weights from input to hidden layer) One common choice for K u ( d ( x u , x )) is � � − 1 d 2 ( x u , x ) K u ( d ( x u , x )) = exp , 2 σ 2 u i.e. Gaussian with mean at x u and variance σ 2 u , all features independent [note bug on p. 239] 11

Training Radial Basis Function Networks 1. Choose number of kernel functions (hidden units) • If = number training exs, can fit training data ex- actly by placing one center per ex • Using fewer ⇒ more efficient, less chance of over- fitting 2. Choose center ( = mean for Gaussian) x u of kernel function K u ( d ( x u , x )) • Use all training instances if enough kernels avail. • Use subset of training instances • Scatter uniformly throughout instance space • Can cluster data and assign one per cluster (helps answer step 1 also) • Can use EM to find means of mixture of Gaussians • Can also use e.g. EM to find σ u ’s (for Gaussian) 3. Hold kernels fixed and train weights to fit linear function (output layer), e.g. GD or EG 12

Case-Based Reasoning and CADET Can apply instance-based learning even when X much more complex Need different “distance” metric Case-Based Reasoning is instance-based learning where instances have symbolic logic descriptions ((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (memory 48meg) (installed-apps Excel Netscape VirusScan) (disk 1gig) (likely-cause ???)) CADET: 75 stored examples of mechanical devices, e.g. water faucets • Training ex: � qualitative function, mech. structure � • New query: desired function • Target value: mechanical structure for this function Distance metric: match qualitative function descriptions 13

Case-Based Reasoning in CADET Example A stored case: T−junction pipe Structure: Function: Q ,T T = temperature Q 1 1 + Q 1 = waterflow Q 3 Q + 2 Q ,T 3 3 T + 1 T 3 Q ,T T + 2 2 2 A problem specification: Water faucet Structure: Function: + C Q + ? t c + + Q + m C Q + f h − + + T c T m T + h E.g. distance measure = size of largest isomorphic sub- graph 14

Case-Based Reasoning in CADET (cont’d) • Instances represented by rich structural (symbolic) descriptions, vs. e.g. points in ℜ n for k -NN • Multiple cases retrieved (and combined) to form so- lution to new problem: Similar to k -NN, except combination procedure can rely on knowledge-based reasoning (e.g. can two components be fit together?) • Tight coupling between case retrieval, knowledge-based reasoning, and problem solving, e.g. application of rewrite rules in function graphs and backtracking in search space Bottom line: • Simple matching of cases useful for tasks such as an- swering help-desk queries • Area of ongoing research, including improving index- ing and search methods 15

Lazy and Eager Learning Lazy: Wait for query before generalizing • k -NN, locally weighted regression, Case based reasoning Eager: Generalize before seeing query • Radial basis function networks, ID3, Backpropaga- tion, Naive Bayes Does it matter? • Computation time for training and generalization • Eager learner must create global approximation, lazy learner can create many local approximations • If they use same H , lazy can represent more complex functions (e.g. consider H = linear functions) since it considers the query instance x q before generalizing, i.e. lazy produces a new hypothesis for each new x q 16

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott - PDF document

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchells slides) November 14, 2006 1 Outline k -Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning

Introduction CSCE CSCE In Homework 1, you are (supposedly) 478/878 478/878 Lecture 4:

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Introduction Decision Tree for PlayTennis (Mitchell) CSCE CSCE 478/878 478/878 Outlook

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 478/878 Lecture 2: Concept Learning General-to-specific ordering over hypotheses and the

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

CSCE 478/878 Lecture 3: Computational Learning Theory Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

CSCE 478/878 Lecture 4: Artificial Neural Networks Stephen D. Scott (Adapted from Tom

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

Semigroups, Frobenius number and M obius function J.L. Ram rez Alfons n IMAG,

Lecture 14: Shape Google: Rigid Shape Statistics COMPSCI/MATH 290-04 Chris Tralie, Duke

A Few Thoughts on Box Corrections in Nuclei J. Engel September 29, 2017 Similarity of Two-Body

Reform! Research on learning shows: the entire standards-based Some benefits of

Evidential Clustering: a Review of Some New Developments Thierry Denux Universit de

CNN and Musical Applications Juhan Nam Motivation Sensory data (image or audio) have

An Introduction to Hilbert Space Embedding of Probability Measures Krikamol Muandet Max Planck