CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pošík c � 2017 Artificial Intelligence – 1 / 40
Nearest neighbors P. Pošík c � 2017 Artificial Intelligence – 2 / 40
Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Class. example • Regression example ■ NN (neural networks) • k -NN Summary SVM Decision Trees Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 40
Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Class. example • Regression example ■ NN (neural networks) • k -NN Summary ■ Training: Just remember the whole training dataset T . SVM ■ Prediction: To get the model prediction for a new data point x (query), Decision Trees Summary ■ find the set N k ( x ) of k nearest neighbors of x in T using certain distance measure, y = h ( x ) as the majority ■ in case of classification , determine the predicted class � vote among the nearest neighbors, i.e. I ( y ′ = y ) , ∑ y = h ( x ) = arg max � y ( x ′ , y ′ ) ∈ N k ( x ) where I ( P ) is an indicator function (returns 1 if P is true, 0 otherwise). ■ in case of regression , determine the predicted value � y = h ( x ) as the average of values y of the nearest neighbors, i.e. y = h ( x ) = 1 y ′ , ∑ � k ( x ′ , y ′ ) ∈ N k ( x ) ■ What is the influence of k to the final model? P. Pošík c � 2017 Artificial Intelligence – 3 / 40
KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 P. Pošík c � 2017 Artificial Intelligence – 4 / 40
KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 ■ Only in 1-NN, all training examples are classified correctly (unless there are two exactly the same observations with a different evaluation). ■ Unbalanced classes may be an issue: the more frequent class takes over with increasing k . P. Pošík c � 2017 Artificial Intelligence – 4 / 40
k -NN Regression Example The training data: Nearest neighbors • kNN • Class. example • Regression example • k -NN Summary SVM Decision Trees 10 Summary 5 10 0 5 10 5 0 P. Pošík c � 2017 Artificial Intelligence – 5 / 40
k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 P. Pošík c � 2017 Artificial Intelligence – 6 / 40
k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 ■ For small k , the surface is rugged. ■ For large k , too much averaging (smoothing) takes place. P. Pošík c � 2017 Artificial Intelligence – 6 / 40
k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tassellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Class. example • Regression example ■ Each of the k nearest neighbors has a weight inversely proportional to its • k -NN Summary distance to the query point. SVM ■ Prediction is then done using weighted voting (in case of classification) or Decision Trees weighted averaging (in case of regression). Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. P. Pošík c � 2017 Artificial Intelligence – 7 / 40
k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tassellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Class. example • Regression example ■ Each of the k nearest neighbors has a weight inversely proportional to its • k -NN Summary distance to the query point. SVM ■ Prediction is then done using weighted voting (in case of classification) or Decision Trees weighted averaging (in case of regression). Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. Advantages: ■ Simple and widely applicable method. ■ For both classification and regression tasks. ■ For both categorial and continuous predictors (independent variables). Disadvantages: ■ Must store the whole training set (there are methods for training set reduction). ■ During prediction, it must compute the distances to all the training data points (can be alleviated e.g. by using KD-tree structure for the training set). Overfitting prevention: ■ Choose the right value of k e.g. using crossvalidation. P. Pošík c � 2017 Artificial Intelligence – 7 / 40
Support vector machine P. Pošík c � 2017 Artificial Intelligence – 8 / 40
Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 P. Pošík c � 2017 Artificial Intelligence – 9 / 40
Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. P. Pošík c � 2017 Artificial Intelligence – 9 / 40
Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. What if we put these two things together? P. Pošík c � 2017 Artificial Intelligence – 9 / 40
Optimal separating hyperplane combined with the basis expansion Using the optimal sep. hyperplane, the examples x occur only in the form of dot products: | T | | T | | T | α i − 1 α i α j y ( i ) y ( j ) x ( i ) x ( j ) T Nearest neighbors ∑ ∑ ∑ the optimization criterion 2 SVM i = 1 i = 1 j = 1 • Revision � | T | � • OSH + basis exp. α i y ( i ) x ( i ) x T + w 0 ∑ and in the decision rule f ( x ) = sign • Kernel trick . • SVM i = 1 • Linear SVM • Gaussian SVM • SVM: Summary Decision Trees Summary P. Pošík c � 2017 Artificial Intelligence – 10 / 40
Recommend
More recommend