Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019
Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2
Classification problem } Given:Training set * π % , π§ % } labeled set of π input-output pairs πΈ = %() } π§ β {1, β¦ , πΏ} } Goal: Given an input π , assign it to one of πΏ classes } Examples: } Spam filter } Handwritten digit recognition 3
Linear classifiers } Decision boundaries are linear in π , or linear in some given set of functions of π } Linearly separable data: data points that can be exactly classified by a linear decision surface. } Why linear classifier? } Even when they are not optimal, we can use their simplicity } are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 4
Two Category } π π; π = π 4 π + π₯ 7 = π₯ 7 + π₯ ) π¦ ) + . . . π₯ ; π¦ ; } π = π¦ ) π¦ < β¦ π¦ ; } π = [π₯ ) π₯ < β¦ π₯ ; ] } π₯ 7 : bias } if π 4 π + π₯ 7 β₯ 0 then π ) } else π < Decision surface (boundary): π 4 π + π₯ 7 = 0 π is orthogonal to every vector lying within the decision surface 5
Example 3 β 3 4 π¦ ) β π¦ < = 0 π¦ 2 3 if π 4 π + π₯ 7 β₯ 0 then π ) 2 else π < 1 π¦ 1 4 1 2 3 6
Linear classifier: Two Category } Decision boundary is a ( π β 1 )-dimensional hyperplane πΌ in the π -dimensional feature space } The orientation of πΌ is determined by the normal vector π₯ ) , β¦ , π₯ ; } π₯ 7 determines the location of the surface. } The normal distance from the origin to the decision surface is H I π π = π J + π π π β π = π 4 π + π₯ 7 π J π 4 π + π₯ 7 = π π π π π = 0 gives a signed measure of the perpendicular distance π of the point π from the decision surface 7
Linear boundary: geometry π 4 π + π₯ 7 > 0 π 4 π + π₯ 7 = 0 π 4 π + π₯ 7 < 0 π 4 π + π₯ 7 π 8
Non-linear decision boundary } Choose non-linear features } Classifier still linear in parameters π < + π¦ < < = 0 π¦ 2 β1 + π¦ ) < , π < < , π ) π < ] π π = [1, π ) , π < , π ) 1 π = π₯ 7 , π₯ ) , β¦ , π₯ R = [β1, 0, 0,1,1,0] π¦ 1 1 1 - 1 if π 4 π(π) β₯ 0 then π§ = 1 else π§ = β1 π = [π ) , π < ] 9
Cost Function for linear classification } Finding linear classifiers can be formulated as an optimization problem : } Select how to measure the prediction loss S π % , π§ % } Based on the training set πΈ = , a cost function πΎ π is defined %() } Solve the resulting optimization problem to find parameters: U π = π π; π } Find optimal π V where π V = argmin πΎ π π } Criterion or cost functions for classification: } We will investigate several cost functions for the classification problem 10
SSE cost function for classification πΏ = 2 SSE cost function is not suitable for classification: } Least square loss penalizes βtoo correctβ predictions (that they lie a long way on the correct side of the decision) } Least square loss also lack robustness to noise * π§ β {β1, +1} < πΎ π = ] π 4 π % β π§ % %() 11
SSE cost function for classification πΏ = 2 π 4 π β π§ < π§ = 1 1 π 4 π Correct predictions that π 4 π β π§ < are penalized by SSE π§ = β1 [Bishop] β1 π 4 π 12
SSE cost function for classification πΏ = 2 } Is it more suitable if we set π π; π = π π 4 π ? * sign π 4 π β π§ < < πΎ π = ] sign π 4 π % β π§ % π§ = 1 %() sign π¨ = aβ 1, π¨ < 0 1, π¨ β₯ 0 π 4 π } πΎ π is a piecewise constant function shows the number of misclassifications πΎ(π) Training error incurred in classifying training samples 13
SSE cost function πΏ = 2 } Is it more suitable if we set π π; π = π π 4 π ? * < πΎ π = ] π π 4 π % β π§ % %() π π¨ = 1 β π de 1 + π de } We see later in this lecture than the cost function of the logistic regression method is more suitable than this cost function for the classification problem 14
Perceptron algorithm } Linear classifier } Two-class: π§ β {β1,1} } π§ = β1 for π· < , π§ = 1 for π· ) } Goal: βπ, π (%) β π· ) β π 4 π (%) > 0 βπ, π % β π· < β π 4 π % < 0 } } π π; π = sign(π 4 π) 15
οΏ½ Perceptron criterion πΎ i π = β ] π 4 π % π§ % %ββ³ β³ : subset of training data that are misclassified Many solutions? Which solution among them? 16
Cost function πΎ(π) πΎ i (π) π₯ 7 π₯ 7 π₯ ) π₯ ) # of misclassifications Perceptronβs as a cost function cost function There may be many solutions in these cost functions 17 [Duda, Hart, and Stork, 2002]
οΏ½ οΏ½ οΏ½ Batch Perceptron βGradient Descentβ to solve the optimization problem: π lm) = π l β ππΌ π πΎ i (π l ) π πΎ i π = β ] π % π§ % πΌ %ββ³ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize π Repeat π % π§ % π = π + π β %ββ³ π % π§ % Until π β < π %ββ³ 18
Stochastic gradient descent for Perceptron } Single-sample perceptron: } If π (%) is misclassified: π lm) = π l + ππ (%) π§ (%) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron π β π π’ β 0 π can be set to 1 and repeat proof still works π’ β π’ + 1 π β π’ mod π if π (%) is misclassified then π = π + π (%) π§ (%) 19 Until all patterns properly classified
Example 20
Perceptron: Example Change π in a direction that corrects the error 21 [Bishop]
Online vs. Batch Learning } Batch Learning } Learn from all the examples at once } Online Learning } Gradually learn as each example is received } Email classification example } Recommendation systems } Examples: recommending movies; predicting whether a user will be interested in a new news article 22
Perceptron Convergence } Assume that βπ β , π β = 1 and some πΏ > 0 such that π§ (S) π β4 π (S) β₯ πΏ for all π = 1, β¦ , π . Also, π (S) π = 1, β¦ , π, β€ π , assume that for all } ~ Perceptron makes at most β’ ~ errors 23
Perceptron Convergence (more general case) } It can be shown: (π,β)βΖ π < π = max V 0 4 π π = 2 min (π,β)βΖ π§π V β4 π π = min (π,β)βΖ π§π 24
Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25
Pocket algorithm } For the data that are not linearly separable due to noise: } Keeps in its pocket the best π encountered up to now. Initialize π for π’ = 1, β¦ , π π β π’ mod π if π (%) is misclassified then π Sβ‘H = π + π (%) π§ (%) if πΉ lβ°Ε %S π Sβ‘H < πΉ lβ°Ε %S π then π = π Sβ‘H end * πΉ lβ°Ε %S π = 1 π ] π‘πππ(π 4 π (S) ) β π§ (S) S() 26
Linear Discriminant Analysis (LDA) } Fisherβs Linear Discriminant Analysis : } Dimensionality reduction } Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables) } Classification } Predicts the class of an observation π by first projecting it to the space of discriminant variables and then classifying it in this space 27
Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 28
Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 29
Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space π 30
LDA Problem } Problem definition: } πΏ = 2 classes * π (%) , π§ (%) training samples with π ) samples from the first class ( π ) ) } %() and π < samples from the second class ( π < ) } Goal: finding the best direction π that we hope to enable accurate classification } The projection of sample π onto a line in direction π is π 4 π } What is the measure of the separation between the projected points of different classes? 31
Measure of Separation in the Projected Direction } Is the direction of the line jointing the class means a good candidate for π ? [Bishop] 32
Recommend
More recommend