Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Pok - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics P. Pošík c � 2017 Artificial Intelligence – 1 / 40

Nearest neighbors P. Pošík c � 2017 Artificial Intelligence – 2 / 40

Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Class. example • Regression example ■ NN (neural networks) • k -NN Summary SVM Decision Trees Summary P. Pošík c � 2017 Artificial Intelligence – 3 / 40

Method of k nearest neighbors ■ Simple, non-parametric, instance-based method for supervised learning, applicable for both classification and regression . ■ Do not confuse k -NN with Nearest neighbors • kNN k -means (a clustering algorithm) ■ • Class. example • Regression example ■ NN (neural networks) • k -NN Summary ■ Training: Just remember the whole training dataset T . SVM ■ Prediction: To get the model prediction for a new data point x (query), Decision Trees Summary ■ find the set N k ( x ) of k nearest neighbors of x in T using certain distance measure, y = h ( x ) as the majority ■ in case of classification , determine the predicted class � vote among the nearest neighbors, i.e. I ( y ′ = y ) , ∑ y = h ( x ) = arg max � y ( x ′ , y ′ ) ∈ N k ( x ) where I ( P ) is an indicator function (returns 1 if P is true, 0 otherwise). ■ in case of regression , determine the predicted value � y = h ( x ) as the average of values y of the nearest neighbors, i.e. y = h ( x ) = 1 y ′ , ∑ � k ( x ′ , y ′ ) ∈ N k ( x ) ■ What is the influence of k to the final model? P. Pošík c � 2017 Artificial Intelligence – 3 / 40

KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 P. Pošík c � 2017 Artificial Intelligence – 4 / 40

KNN classification: Example 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 ■ Only in 1-NN, all training examples are classified correctly (unless there are two exactly the same observations with a different evaluation). ■ Unbalanced classes may be an issue: the more frequent class takes over with increasing k . P. Pošík c � 2017 Artificial Intelligence – 4 / 40

k -NN Regression Example The training data: Nearest neighbors • kNN • Class. example • Regression example • k -NN Summary SVM Decision Trees 10 Summary 5 10 0 5 10 5 0 P. Pošík c � 2017 Artificial Intelligence – 5 / 40

k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 P. Pošík c � 2017 Artificial Intelligence – 6 / 40

k -NN regression example 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 10 10 10 5 5 5 10 10 10 0 0 0 5 5 5 10 10 10 5 5 5 0 0 0 ■ For small k , the surface is rugged. ■ For large k , too much averaging (smoothing) takes place. P. Pošík c � 2017 Artificial Intelligence – 6 / 40

k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tassellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Class. example • Regression example ■ Each of the k nearest neighbors has a weight inversely proportional to its • k -NN Summary distance to the query point. SVM ■ Prediction is then done using weighted voting (in case of classification) or Decision Trees weighted averaging (in case of regression). Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. P. Pošík c � 2017 Artificial Intelligence – 7 / 40

k -NN Summary Comments: ■ For 1-NN, the division of the input space into convex cells is called a Voronoi tassellation . Nearest neighbors • kNN ■ A weighted variant can be constructed: • Class. example • Regression example ■ Each of the k nearest neighbors has a weight inversely proportional to its • k -NN Summary distance to the query point. SVM ■ Prediction is then done using weighted voting (in case of classification) or Decision Trees weighted averaging (in case of regression). Summary ■ In regression tasks, instead of averaging you can use e.g. (weighted) linear regression to compute the prediction. Advantages: ■ Simple and widely applicable method. ■ For both classification and regression tasks. ■ For both categorial and continuous predictors (independent variables). Disadvantages: ■ Must store the whole training set (there are methods for training set reduction). ■ During prediction, it must compute the distances to all the training data points (can be alleviated e.g. by using KD-tree structure for the training set). Overfitting prevention: ■ Choose the right value of k e.g. using crossvalidation. P. Pošík c � 2017 Artificial Intelligence – 7 / 40

Support vector machine P. Pošík c � 2017 Artificial Intelligence – 8 / 40

Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 P. Pošík c � 2017 Artificial Intelligence – 9 / 40

Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. P. Pošík c � 2017 Artificial Intelligence – 9 / 40

Revision Optimal separating hyperplane: ■ A way to find a linear classifier optimal in certain sense by means of a quadratic program (dual task for soft margin version): Nearest neighbors SVM | T | | T | | T | • Revision α i α j y ( i ) y ( j ) x ( i ) x ( j ) T w.r.t. α 1 , . . . , α | T | , µ 1 , . . . , µ | T | , ∑ ∑ ∑ α i − • OSH + basis exp. maximize • Kernel trick i = 1 i = 1 j = 1 • SVM | T | • Linear SVM α i y ( i ) = 0. ∑ • Gaussian SVM subject to α i ≥ 0, µ i ≥ 0, α i + µ i = C , and • SVM: Summary i = 1 Decision Trees ■ The parameters of the hyperplane are given in terms of a weighted linear Summary combination of support vectors: | T | w 0 = y ( k ) − x ( k ) w T , α i y ( i ) x ( i ) , ∑ w = i = 1 Basis expansion: ■ Instead of a linear model � w , x � , create a linear model of nonlinearly transformed features � w ′ , Φ ( x ) � which represents a nonlinear model in the original space. What if we put these two things together? P. Pošík c � 2017 Artificial Intelligence – 9 / 40

Optimal separating hyperplane combined with the basis expansion Using the optimal sep. hyperplane, the examples x occur only in the form of dot products: | T | | T | | T | α i − 1 α i α j y ( i ) y ( j ) x ( i ) x ( j ) T Nearest neighbors ∑ ∑ ∑ the optimization criterion 2 SVM i = 1 i = 1 j = 1 • Revision � | T | � • OSH + basis exp. α i y ( i ) x ( i ) x T + w 0 ∑ and in the decision rule f ( x ) = sign • Kernel trick . • SVM i = 1 • Linear SVM • Gaussian SVM • SVM: Summary Decision Trees Summary P. Pošík c � 2017 Artificial Intelligence – 10 / 40

Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Pok - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Pok Czech Technical University in Prague Faculty of Electrical Engineering

Nearest neighbors. Kernel functions, SVM. Decision trees. Petr Po s k Czech Technical

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

k-Nearest Neighbors Lecture 2 k-Nearest Neighbors September 16, 2015 1 Wentworth Institute of

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees and Nave Bayes 3/29/17 Hypothesis Spaces Decision Trees and K-Nearest

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

Simple and Fast Nearest Neighbor Search Marcel Birn, Manuel Holtgrewe, Peter Sanders , Johannes

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Decision Trees Petr Po s k This lecture is largely based on the book Artificial

Localized basis methods Theory and implementations Introduction of OpenMX

VARIABILITY IN PROCESSES AND QUEUES 2 L EARNNG O BJECTVES Variability and Process

Informatics Personal Tutors Briefing Julian Bradfield inf-st@inf.ed.ac.uk September 2017

Energy Aware Scheduling and Queue Management for Next Generation Wi-Fi Routers Husnu S aner

eMERLIN+EVN status update Eskil Varenius Operational Support Scientist Jodrell Bank Observatory

Neutrino Program Artist impression of a neutrino From MeV to ZeV (eV to MeV covered by Patrick)

Longbase Neutrino Physics Neil McCauley University of Liverpool Birmingham February 2013 1