Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class classification 2
Classification problem Given:Training set 𝑂 𝒚 𝑗 , 𝑧 𝑗 labeled set of 𝑂 input-output pairs 𝐸 = 𝑗=1 𝑧 ∈ {1, … , 𝐿} Goal: Given an input 𝒚 , assign it to one of 𝐿 classes Examples: Spam filter Handwritten digit recognition … 3
Discriminant functions Discriminant function can directly assign each vector 𝒚 to a specific class 𝑙 A popular way of representing a classifier Many classification methods are based on discriminant functions Assumption: the classes are taken to be disjoint The input space is thereby divided into decision regions boundaries are called decision boundaries or decision surfaces. 4
Discriminant Functions Discriminant functions : A discriminant function 𝑔 𝑗 𝒚 for each class 𝒟 𝑗 ( 𝑗 = 1, … , 𝐿 ): 𝒚 is assigned to class 𝒟 𝑗 if: 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚) 𝑘 𝑗 Thus, we can easily divide the feature space into 𝐿 decision regions ∀𝒚, 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚) 𝑘 𝑗 ⇒ 𝒚 ∈ ℛ 𝑗 ℛ 𝑗 : Region of the 𝑗 -th class Decision surfaces (or boundaries) can also be found using discriminant functions Boundary of the ℛ 𝑗 and ℛ 𝑘 separating samples of these two categories: ∀𝒚, 𝑔 𝑗 𝒚 = 𝑔 𝑘 (𝒚) 5
Discriminant Functions: Two-Category Decision surface: 𝑔 𝒚 = 0 For two-category problem, we can only find a function 𝑔 ∶ ℝ d → ℝ 𝑔 1 𝒚 = 𝑔(𝒚) 𝑔 2 𝒚 = −𝑔(𝒚) First, we explain two-category classification problem and then discuss the multi-category problems. Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ −1,1 6
Linear classifiers Decision boundaries are linear in 𝒚 , or linear in some given set of functions of 𝒚 Linearly separable data: data points that can be exactly classified by a linear decision surface. Why linear classifier? Even when they are not optimal, we can use their simplicity are relatively easy to compute In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 7
Two Category 𝑔 𝒚; 𝒙 = 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑥 0 + 𝑥 1 𝑦 1 + . . . 𝑥 𝑒 𝑦 𝑒 𝒚 = 𝑦 1 𝑦 2 … 𝑦 𝑒 𝒙 = [𝑥 1 𝑥 2 … 𝑥 𝑒 ] 𝑥 0 : bias if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1 else 𝒟 2 Decision surface (boundary): 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 is orthogonal to every vector lying within the decision surface 8
Example 3 − 3 4 𝑦 1 − 𝑦 2 = 0 𝑦 2 3 if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1 2 else 𝒟 2 1 𝑦 1 4 1 2 3 9
Linear classifier: Two Category Decision boundary is a ( 𝑒 − 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space The orientation of 𝐼 is determined by the normal vector 𝑥 1 , … , 𝑥 𝑒 𝑥 0 determine the location of the surface. The normal distance from the origin to the decision surface is 𝑥 0 𝒙 𝒚 = 𝒚 ⊥ + 𝑠 𝒙 𝒙 ⇒ 𝑠 = 𝒙 𝑈 𝒚 + 𝑥 0 𝒚 ⊥ 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑠 𝒙 𝒙 𝑔 𝒚 = 0 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface 10
Linear boundary: geometry 𝒙 𝑈 𝒚 + 𝑥 0 > 0 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 𝑈 𝒚 + 𝑥 0 < 0 𝒙 𝑈 𝒚 + 𝑥 0 𝒙 11
Non-linear decision boundary Choose non-linear features Classifier still linear in parameters 𝒙 2 + 𝑦 2 2 = 0 −1 + 𝑦 1 𝑦 2 2 , 𝒚 2 2 , 𝒚 1 𝒚 2 ] 𝝔 𝒚 = [1, 𝒚 1 , 𝒚 2 , 𝒚 1 1 𝒙 = 𝑥 0 , 𝑥 1 , … , 𝑥 𝑛 = [−1, 0, 0,1,1,0] 1 𝑦 1 1 -1 if 𝒙 𝑈 𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒚 = [𝒚 1 , 𝒚 2 ] 12
Cost Function for linear classification Finding linear classifiers can be formulated as an optimization problem : Select how to measure the prediction loss 𝑜 , a cost function 𝐾 𝒙 is defined 𝒚 𝑗 , 𝑧 𝑗 Based on the training set 𝐸 = 𝑗=1 Solve the resulting optimization problem to find parameters: Find optimal where 𝑔 𝒚 = 𝑔 𝒚; 𝒙 𝒙 = argmin 𝐾 𝒙 𝒙 Criterion or cost functions for classification: We will investigate several cost functions for the classification problem 13
SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification: Least square loss penalizes ‘ too correct ’ predictions (that they lie a long way on the correct side of the decision) Least square loss also lack robustness to noise 𝑂 2 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑗=1 14
SSE cost function for classification 𝐿 = 2 𝒙 𝑈 𝒚 − 𝑧 2 𝑧 = 1 1 𝒙 𝑈 𝒚 Correct predictions that 𝒙 𝑈 𝒚 − 𝑧 2 are penalized by SSE 𝑧 = −1 [Bishop] −1 𝒙 𝑈 𝒚 15
SSE cost function for classification 𝐿 = 2 Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝒙 𝑈 𝒚 ? 𝑂 sign 𝒙 𝑈 𝒚 − 𝑧 2 2 sign 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑧 = 1 𝑗=1 sign 𝑨 = −1, 𝑨 < 0 𝒙 𝑈 𝒚 1, 𝑨 ≥ 0 𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 16
Perceptron algorithm Linear classifier Two-class: 𝑧 ∈ {−1,1} 𝑧 = −1 for 𝐷 2 , 𝑧 = 1 for 𝐷 1 Goal: ∀𝑗, 𝒚 (𝑗) ∈ 𝐷 1 ⇒ 𝒙 𝑈 𝒚 (𝑗) > 0 ∀𝑗, 𝒚 𝑗 ∈ 𝐷 2 ⇒ 𝒙 𝑈 𝒚 𝑗 < 0 𝑔 𝒚; 𝒙 = sign(𝒙 𝑈 𝒚) 18
Perceptron criterion 𝒙 𝑈 𝒚 𝑗 𝑧 𝑗 𝐾 𝑄 𝒙 = − 𝑗∈ℳ ℳ : subset of training data that are misclassified Many solutions?Which solution among them? 19
Cost function 𝐾(𝒙) 𝐾 𝑄 (𝒙) 𝑥 0 𝑥 0 𝑥 1 𝑥 1 # of misclassifications Perceptron ’ s as a cost function cost function There may be many solutions in these cost functions 20 [Duda, Hart, and Stork, 2002]
Batch Perceptron “ Gradient Descent ” to solve the optimization problem: 𝒙 𝑢+1 = 𝒙 𝑢 − 𝜃𝛼 𝒙 𝐾 𝑄 (𝒙 𝑢 ) 𝒚 𝑗 𝑧 𝑗 𝛼 𝒙 𝐾 𝑄 𝒙 = − 𝑗∈ℳ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 Until 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 < 𝜄 21
Stochastic gradient descent for Perceptron Single-sample perceptron: If 𝒚 (𝑗) is misclassified: 𝒙 𝑢+1 = 𝒙 𝑢 + 𝜃𝒚 (𝑗) 𝑧 (𝑗) Perceptron convergence theorem: for linearly separable data If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) Until all patterns properly classified 22
Example 23
Perceptron: Example Change 𝒙 in a direction that corrects the error 24 [Bishop]
Convergence of Perceptron [Duda, Hart & Stork, 2002] For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25
Pocket algorithm For the data that are not linearly separable due to noise: Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 𝑜𝑓𝑥 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) if 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 𝑜𝑓𝑥 < 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 then 𝒙 = 𝒙 𝑜𝑓𝑥 end 𝑂 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 = 1 𝑡𝑗𝑜(𝒙 𝑈 𝒚 (𝑜) ) ≠ 𝑧 (𝑜) 𝑂 𝑜=1 26
Linear Discriminant Analysis (LDA) Fisher ’ s Linear Discriminant Analysis : Dimensionality reduction Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables) Classification Predicts the class of an observation 𝒚 by first projecting it to the space of discriminant variables and then classifying it in this space 27
Good Projection for Classification What is a good criterion? Separating different classes in the projected space 28
Good Projection for Classification What is a good criterion? Separating different classes in the projected space 29
Good Projection for Classification What is a good criterion? Separating different classes in the projected space 𝒙 30
Recommend
More recommend