10-701 Machine Learning Classification Related reading: Mitchell - PowerPoint PPT Presentation

10-701 Machine Learning Classification Related reading: Mitchell 8.1,8.2; Bishop 1.5

Where we are Inputs Prob- Density √ ability Estimator Inputs Predict Today Classifier category Inputs Predict Regressor Later real no.

Classification • Assume we want to teach a computer to distinguish between cats and dogs …

Bayes decision rule x – input feature set y - label • If we know the conditional probability p(x | y) and class priors p(y) we can determine the appropriate class by using Bayes rule: P ( y = i | x ) = P ( x | y = i ) P ( y = i ) def Minimizes our = q i ( x ) P ( x ) probability of • We can use q i (x) to select the appropriate class. making a mistake We chose class 0 if q 0 (x)  q 1 (x) and class 1 otherwise • • This is termed the ‘Bayes decision rule’ and leads to ฀ Note that p(x) optimal classification. does not affect • However, it is often very hard to compute … our decision

Bayes decision rule P ( y = i | x ) = P ( x | y = i ) P ( y = i ) def = q i ( x ) P ( x ) • We can also use the resulting probabilities to determine our confidence in the class assignment by looking at the likelihood ฀ ratio: L ( x ) = q 0 ( x ) q 1 ( x ) Also known as likelihood ratio, we will talk more about this later ฀

Confidence: Example Normal Gaussians Normal Gaussians x 1 x 1  1  1  1  1 x x  2 x  2 x  2  2 x 2 x 2

Bayes error and risk P(Y|X) P 0 (X)P(Y=0) P 1 (X)P(Y=1) X x values for which we • Risk for sample x is defined as: will have errors R(x) = min{P 1 (x)P(y=1), P 0 (x)P(y=0)} / P(x) Risk can be used to determine a ‘reject’ region

Bayes risk P(Y|X) P 0 (X)P(Y=0) P 1 (X)P(Y=1) • The probability that we assign a sample to the wrong class, is known as the risk X • The risk for sample x is: R(x) = min{P 1 (x)P(y=1), P 0 (x)P(y=0)} / P(x) • We can also compute the expected risk (the risk  = E [ r ( x )] r ( x ) p ( x ) dx for the entire range of values of x): x  = min{ p 1 ( x ) p ( y = 1), p 0 ( x ) p ( y = 0)} dx x   = p ( y = 0) + p ( y = 1) p 0 ( x ) dx p 1 ( x ) dx L 1 is the region where we assign instances to class 1 L 1 L 0 ฀

Loss function • The risk value we computed assumes that both errors (assigning instances of class 1 to class 0 and vice versa) are equally harmful. • However, this is not always the case. • Why? • In general our goal is to minimize loss, often defined by a loss function: L 0,1 (x) which is the penalty we pay when assigning instances of class 0 to class 1   = = + = [ ] ( 0 ) ( ) ( 1 ) ( ) E L L p y p x dx L p y p x dx 0 , 1 0 1 , 0 1 L L 1 0

Types of classifiers • We can divide the large variety of classification approaches into roughly two main types 1. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 2. Generative: - build a generative statistical model - e.g., Naïve Bayes 3. Discriminative - directly estimate a decision rule/boundary - e.g., decision tree

Classification • Assume we want to teach a computer to distinguish between cats and dogs … Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection

Classification • Assume we want to teach a computer to distinguish between cats and dogs … Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection How do we encode the picture? A collection of pixels? Do we use the entire image or a subset? …

Classification • Assume we want to teach a computer to distinguish between cats and dogs … Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection What type of classifier should we use?

Classification • Assume we want to teach a computer to distinguish between cats and dogs … Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection How do we learn the parameters of our classifier? Do we have enough examples to learn a good model?

Classification • Assume we want to teach a computer to distinguish between cats and dogs … Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection Do we really need all the features? Can we use a smaller number and still achieve the same (or better) results?

Supervised learning • Classification is one of the key components of ‘supervised learning’ • Unlike other learning paradigms, in supervised learning the teacher (us) provides the algorithm with the solutions to some of the instances and the goal is to generalize so that a model / method can be used to determine the labels of the unobserved samples Classifier X Y w 1 , w 2 … X,Y teacher

Types of classifiers • We can divide the large variety of classification approaches into roughly two main types 1. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 2. Generative: - build a generative statistical model - e.g., Bayesian networks 3. Discriminative - directly estimate a decision rule/boundary - e.g., decision tree

K nearest neighbors

K nearest neighbors (KNN) • A simple, yet surprisingly efficient algorithm • Requires the definition of a distance function or similarity measures between samples • Select the class based on the majority vote in the k closest ? points

K nearest neighbors (KNN) • Need to determine an appropriates value for k • What happens if we chose k=1? • What if k=3? ?

K nearest neighbors (KNN) • Choice of k influences the ‘smoothness’ of the resulting classifier • In that sense it is similar to a kernel methods (discussed later in the course) • However, the smoothness of the ? function is determined by the actual distribution of the data (p(x)) and not by a predefined parameter.

The effect of increasing k

The effect of increasing k We will be using Euclidian distance to determine what are the k nearest neighbors:  d ( x , x ') = ( x i − x i ') 2 i ฀

KNN with k=1

KNN with k=3 Ties are broken using the order: Red , Green, Blue

KNN with k=5 Ties are broken using the order: Red , Green, Blue

Comparisons of different k’s K = 1 K = 3 K = 5

A probabilistic interpretation of KNN • The decision rule of KNN can be viewed using a probabilistic interpretation • What KNN is trying to do is approximate the Bayes decision rule on a subset of the data • To do that we need to compute certain properties including the conditional probability of the data given the class (p(x|y)), the prior probability of each class (p(y)) and the marginal probability of the data (p(x)) • These properties would be computed for some small region around our sample and the size of that region will be dependent on the distribution of the test samples* * Remember this idea. We will return to it when discussing kernel functions

Computing probabilities for KNN • Let V be the volume of the m dimensional ball around z containing the k nearest neighbors for z (where m is the number of features). • Then we can write p ( x ) V = P = K p ( x ) = K p ( x | y = 1) = K 1 p ( y = 1) = N 1 N NV N 1 V N • Using Bayes rule we get: z – new data point to classify ฀ ฀ ฀ ฀ V - selected ball P – probability that a random point is in V = = ( | 1 ) ( 1 ) p z y p y K N - total number of samples = = = 1 ( 1 | ) p y z ( ) p z K K - number of nearest neighbors N 1 - total number of samples from class 1 K 1 - number of samples from class 1 in K

Computing probabilities for KNN N - total number of samples V - Volume of selected ball K - number of nearest neighbors N 1 - total number of samples from class 1 K 1 - number of samples from class 1 in K • Using Bayes rule we get: = = ( | 1 ) ( 1 ) p z y p y K = = = 1 ( 1 | ) p y z ( ) p z K Using Bayes decision rule we will chose the class with the highest probability, which in this case is the class with the highest number of samples in K

Important points • Optimal decision using Bayes rule • Types of classifiers • Effect of values of k on knn classifiers • Probabilistic interpretation of knn • Possible reading: Mitchell, Chapters 1,2 and 8.

10-701 Machine Learning Classification Related reading: Mitchell - PowerPoint PPT Presentation

10-701 Machine Learning Classification Related reading: Mitchell 8.1,8.2; Bishop 1.5 Where we are Inputs Prob- Density ability Estimator Inputs Predict Today Classifier category Inputs Predict Regressor Later real no.

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Hafner valves with Namur interface With the standard MNH 310 701 and MNH 510 701 Hafner offers

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

Leveraging Supply Chain Finance to Optimize Value Brad Peterson +1 312 701 8568

Negotiating ERP Implementation Agreements for Success Paul Chandler 312 701 8499

North Dakota Public Service Commission www.psc.nd.gov 701-328-2400 Annette Bendish

Contracting for Digital Services Joe Pennell Brad Peterson Partner Partner +1 312 701 8354 +1

3.1 Architecture 3 Systems Alexander Smola Introduction to Machine Learning 10-701

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Carnegie Mellon

Recitation First recitation tomorrow 56:30 here Linear algebra Geoff Gordon10-701

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701

Internet of Things for B2B Connected Devices, Data and Connected Devices, Data and IoT IoT

North Dakota Department of Mineral Resources http://www.oilgas.nd.gov http://www.state.nd.us/ndgs

Products from Brazil and China Inv. Nos. 701-TA-636 and 731-TA-1469-1470 (P) January 29, 2020

Lecture 17 Spatial Data and Cartography (Part 2) Colin Rundel 03/22/2017 1 Plotting 2

Lecture 16 Spatial Data and Cartography Colin Rundel 03/20/2017 1 Background 2 Analysis of

CS414 Section 1 Project 1: Minithreads Owen Arden owen@cs.cornell.edu All slides stolen. What

BBM406 Fundamentals of Machine Learning Lecture 21: Clustering K-Means Aykut Erdem //

Week 1, Video 4 Classifiers, Part 2 Classification There is something you want to predict

Hypothesis Testing Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Machine Learning III: Beyond Decision Trees Extensions to Decision Trees AI Class 15 (Ch.

CS 188: Artificial Intelligence Perceptrons and Logistic Regression Anca Dragan University of