Machine Learning Instance Based Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 17
Table of contents Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 2 / 17
Outline Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 17
Introduction The methods described before such as decision tree, Bayesian classifiers, and 1 boosting, at the first find hypothesis and then this hypothesis will be used for classification of new test examples. These methods are called eager learning. 2 The instance based learning algorithms such as k-NN store all of the training 3 examples and then classify a new example x by finding the training example ( x i , y i ) that is nearest to x according to some distance metric. Instance based classifiers do not explicitly compute decision boundaries. However, 4 the boundaries form a subset of the Voronoi diagram of the training data. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 17
Outline Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 4 / 17
Nearest neighbor algorithms Fix k ≥ 1, given a labeled sample 1 S = { ( x 1 , t 1 ) , . . . , ( x N , t N ) } where t i ∈ { 0 , 1 } . The k -NN for all test examples x returns the hypothesis h defined by � � . h ( x ) = I w i > w i i , t i =1 i , t i =0 where the weights w 1 , . . . , w N are chosen such that w i = 1 k if x i is among the k nearest neighbors of x . The boundaries form a subset of the Voronoi diagram of the training data. 2 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 4 / 17
Nearest neighbor algorithms The k -NN only requires 1 An integer k . A set of labeled examples S . A metric to measure closeness. For all points x , y , z , a metric d must satisfy the following properties. 2 Non-negativity : d ( x , y ) ≥ 0. Reflexivity : d ( x , y ) = 0 ⇔ x = y . Symmetry : d ( x , y ) = d ( y , x ). Triangle inequality : d ( x , y ) + d ( y , z ) ≥ d ( x , z ). Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 5 / 17
Distance functions The Minkowski distance for D -dimensional examples is the L p norm. 1 � D � 1 p � | x i − y i | p L p ( x , y ) = i =1 The Euclidean distance is the L 2 norm 2 � 1 � D 2 � | x i − y i | 2 L 2 ( x , y ) = i =1 The Manhattan or city block distance is the L 2 norm 3 D � L 1 ( x , y ) = | x i − y i | i =1 The L ∞ norm is the maximum of distances along axes 4 L ∞ ( x , y ) = max | x i − y i | i Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 6 / 17
����������� Nearest neighbor algorithm for regression The k -NN algorithm adapted for approximating continuous-valued target function. 1 We calculate the mean of k nearest neighborhood training examples rather than 2 � k i =1 f ( x i ) majority vote : ˆ f ( x ) = . k The effect of k on the performance of algorithm 1 3 1 Pictures are taken from P. Rai slide. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 7 / 17
Nearest neighbor algorithms The k -NN algorithm is a lazy learning algorithm. 1 It defers the hypothesis finding until a test example x arrives. For test example x , the k -NN uses the stored training data. Discards the the found hypothesis and any intermediate results. This strategy is opposed to an eager learning algorithm which 2 It finds a hypothesis h using the training set It uses the found hypothesis h for classification of test example x . Trade offs 3 During training phase, lazy algorithms have fewer computational costs than eager algorithms. During testing phase, lazy algorithms have greater storage requirements and higher computational costs. What is inductive bias of k -NN? 4 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 8 / 17
Properties of nearest neighbor algorithms Advantages 1 Analytically tractable Simple implementation Use local information, which results in highly adaptive behavior. It parallel implementation is very easy. Nearly optimal in the large sample ( N → ∞ ). E ( Bayes ) < E ( NN ) < 2 × E ( Bayes ) . Disadvantages 2 Large storage requirements. It needs a high computational cost during testing. Highly susceptible to the irrelevant features. Large values of k 3 Results in smoother decision boundaries. Provides more accurate probabilistic information But large values of k 4 Increases computational cost. Destroys the locality of estimation. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 9 / 17
Outline Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 10 / 17
Distance-weighted nearest neighbor algorithms One refinement of k -NN is to weight the contribution of each k neighbors to their 1 distance to the query point x . For two class classification 2 � � . h ( x ) = I w i > w i i , t i =1 i , t i =0 where 1 w i = d ( x , x i ) 2 For C class classification 3 k � h ( x ) = argmax w i δ ( c , t i ) . c ∈ C i =1 For regression 4 � k i =1 w i f ( x i ) ˆ f ( x ) = . w i Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 10 / 17
Outline Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 11 / 17
Locally weighted regression In locally weighted regression (LWR), we use a linear model to do the local 1 approximation ˆ f : ˆ f ( x ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D . Suppose we aim to minimize the total squared error: 2 E = 1 � ( f ( x ) − ˆ f ( x )) 2 2 x ∈ S Using gradient descent 3 � ( f ( x ) − ˆ ∆ w j = η f ( x )) x j x ∈ S where η is a small number (the learning rate). Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 11 / 17
Locally weighted regression I How shall we modify this procedure to derive a local approximation rather than a 1 global one? The simple way is to redefine the error criterion E to emphasize fitting the local 2 training examples. Three possible criteria are given below. Note we write the error E ( x q ) to emphasize 3 the fact that now the error is being defined as a function of the query point x q . Minimize the squared error over just the k nearest neighbors: E 1 ( x q ) = 1 � ( f ( x ) − ˆ f ( x )) 2 2 x ∈ KNN ( x q ) Minimize 1 squared error over the set S of training examples, while weighting the error of each training example by some decreasing function k of its distance from x q E 2 ( x q ) = 1 � ( f ( x ) − ˆ f ( x )) 2 K ( d ( x q , x )) 2 x ∈ S Combine 1 and 2: E 3 ( x q ) = 1 � ( f ( x ) − ˆ f ( x )) 2 K ( d ( x q , x )) 2 x ∈ KNN ( x q ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 12 / 17
Locally weighted regression II If we choose criterion three above and re-derive the gradient descent rule, we obtain 4 K ( d ( x q , x ))( f ( x ) − ˆ � ∆ w j = η f ( x )) x j x ∈ KNN ( x q ) where η is a small number (the learning rate). Criterion two is perhaps the most esthetically pleasing because it allows every 5 training example to have an impact on the classification of x q . However, this approach requires computation that grows linearly with the number 6 of training examples. Criterion (3) is a good approximation to criterion (2) and has the advantage that 7 computational cost is independent of the total number of training examples; its cost depends only on the number k of neighbors considered. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 13 / 17
Outline Introduction 1 Nearest neighbor algorithms 2 Distance-weighted nearest neighbor algorithms 3 Locally weighted regression 4 Finding KNN ( x ) efficiently 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 14 / 17
Recommend
More recommend