linear classifiers
play

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Discriminant functions Linear classifiers Perceptron SVM will be covered in the later lectures Fisher Multi-class


  1. Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Topics  Discriminant functions  Linear classifiers  Perceptron SVM will be covered in the later lectures  Fisher  Multi-class classification 2

  3. Classification problem  Given:Training set 𝑂 𝒚 𝑗 , 𝑧 𝑗  labeled set of 𝑂 input-output pairs 𝐸 = 𝑗=1  𝑧 ∈ {1, … , 𝐿}  Goal: Given an input 𝒚 , assign it to one of 𝐿 classes  Examples:  Spam filter  Handwritten digit recognition  … 3

  4. Discriminant functions  Discriminant function can directly assign each vector 𝒚 to a specific class 𝑙  A popular way of representing a classifier  Many classification methods are based on discriminant functions  Assumption: the classes are taken to be disjoint  The input space is thereby divided into decision regions  boundaries are called decision boundaries or decision surfaces. 4

  5. Discriminant Functions  Discriminant functions : A discriminant function 𝑔 𝑗 𝒚 for each class 𝒟 𝑗 ( 𝑗 = 1, … , 𝐿 ):  𝒚 is assigned to class 𝒟 𝑗 if: 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗  Thus, we can easily divide the feature space into 𝐿 decision regions ∀𝒚, 𝑔 𝑗 (𝒚) > 𝑔 𝑘 (𝒚)  𝑘  𝑗 ⇒ 𝒚 ∈ ℛ 𝑗 ℛ 𝑗 : Region of the 𝑗 -th class  Decision surfaces (or boundaries) can also be found using discriminant functions  Boundary of the ℛ 𝑗 and ℛ 𝑘 separating samples of these two categories: ∀𝒚, 𝑔 𝑗 𝒚 = 𝑔 𝑘 (𝒚) 5

  6. Discriminant Functions: Two-Category  Decision surface: 𝑔 𝒚 = 0  For two-category problem, we can only find a function 𝑔 ∶ ℝ d → ℝ  𝑔 1 𝒚 = 𝑔(𝒚)  𝑔 2 𝒚 = −𝑔(𝒚)  First, we explain two-category classification problem and then discuss the multi-category problems.  Binary classification: a target variable 𝑧 ∈ 0,1 or 𝑧 ∈ −1,1 6

  7. Linear classifiers  Decision boundaries are linear in 𝒚 , or linear in some given set of functions of 𝒚  Linearly separable data: data points that can be exactly classified by a linear decision surface.  Why linear classifier?  Even when they are not optimal, we can use their simplicity  are relatively easy to compute  In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 7

  8. Two Category  𝑔 𝒚; 𝒙 = 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑥 0 + 𝑥 1 𝑦 1 + . . . 𝑥 𝑒 𝑦 𝑒  𝒚 = 𝑦 1 𝑦 2 … 𝑦 𝑒  𝒙 = [𝑥 1 𝑥 2 … 𝑥 𝑒 ]  𝑥 0 : bias  if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1  else 𝒟 2 Decision surface (boundary): 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 is orthogonal to every vector lying within the decision surface 8

  9. Example 3 − 3 4 𝑦 1 − 𝑦 2 = 0 𝑦 2 3 if 𝒙 𝑈 𝒚 + 𝑥 0 ≥ 0 then 𝒟 1 2 else 𝒟 2 1 𝑦 1 4 1 2 3 9

  10. Linear classifier: Two Category  Decision boundary is a ( 𝑒 − 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space  The orientation of 𝐼 is determined by the normal vector 𝑥 1 , … , 𝑥 𝑒  𝑥 0 determine the location of the surface.  The normal distance from the origin to the decision surface is 𝑥 0 𝒙 𝒚 = 𝒚 ⊥ + 𝑠 𝒙 𝒙 ⇒ 𝑠 = 𝒙 𝑈 𝒚 + 𝑥 0 𝒚 ⊥ 𝒙 𝑈 𝒚 + 𝑥 0 = 𝑠 𝒙 𝒙 𝑔 𝒚 = 0 gives a signed measure of the perpendicular distance 𝑠 of the point 𝒚 from the decision surface 10

  11. Linear boundary: geometry 𝒙 𝑈 𝒚 + 𝑥 0 > 0 𝒙 𝑈 𝒚 + 𝑥 0 = 0 𝒙 𝑈 𝒚 + 𝑥 0 < 0 𝒙 𝑈 𝒚 + 𝑥 0 𝒙 11

  12. Non-linear decision boundary  Choose non-linear features  Classifier still linear in parameters 𝒙 2 + 𝑦 2 2 = 0 −1 + 𝑦 1 𝑦 2 2 , 𝒚 2 2 , 𝒚 1 𝒚 2 ] 𝝔 𝒚 = [1, 𝒚 1 , 𝒚 2 , 𝒚 1 1 𝒙 = 𝑥 0 , 𝑥 1 , … , 𝑥 𝑛 = [−1, 0, 0,1,1,0] 1 𝑦 1 1 -1 if 𝒙 𝑈 𝝔(𝒚) ≥ 0 then 𝑧 = 1 else 𝑧 = −1 𝒚 = [𝒚 1 , 𝒚 2 ] 12

  13. Cost Function for linear classification  Finding linear classifiers can be formulated as an optimization problem :  Select how to measure the prediction loss 𝑜 , a cost function 𝐾 𝒙 is defined 𝒚 𝑗 , 𝑧 𝑗  Based on the training set 𝐸 = 𝑗=1  Solve the resulting optimization problem to find parameters:  Find optimal where 𝑔 𝒚 = 𝑔 𝒚; 𝒙 𝒙 = argmin 𝐾 𝒙 𝒙  Criterion or cost functions for classification:  We will investigate several cost functions for the classification problem 13

  14. SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification:  Least square loss penalizes ‘ too correct ’ predictions (that they lie a long way on the correct side of the decision)  Least square loss also lack robustness to noise 𝑂 2 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑗=1 14

  15. SSE cost function for classification 𝐿 = 2 𝒙 𝑈 𝒚 − 𝑧 2 𝑧 = 1 1 𝒙 𝑈 𝒚 Correct predictions that 𝒙 𝑈 𝒚 − 𝑧 2 are penalized by SSE 𝑧 = −1 [Bishop] −1 𝒙 𝑈 𝒚 15

  16. SSE cost function for classification 𝐿 = 2  Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙 𝑈 𝒚 ? 𝑂 sign 𝒙 𝑈 𝒚 − 𝑧 2 2 sign 𝒙 𝑈 𝒚 𝑗 − 𝑧 𝑗 𝐾 𝒙 = 𝑧 = 1 𝑗=1 sign 𝑨 = −1, 𝑨 < 0 𝒙 𝑈 𝒚 1, 𝑨 ≥ 0  𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 16

  17. Perceptron algorithm  Linear classifier  Two-class: 𝑧 ∈ {−1,1}  𝑧 = −1 for 𝐷 2 , 𝑧 = 1 for 𝐷 1  Goal: ∀𝑗, 𝒚 (𝑗) ∈ 𝐷 1 ⇒ 𝒙 𝑈 𝒚 (𝑗) > 0 ∀𝑗, 𝒚 𝑗 ∈ 𝐷 2 ⇒ 𝒙 𝑈 𝒚 𝑗 < 0   𝑔 𝒚; 𝒙 = sign(𝒙 𝑈 𝒚) 18

  18. Perceptron criterion 𝒙 𝑈 𝒚 𝑗 𝑧 𝑗 𝐾 𝑄 𝒙 = − 𝑗∈ℳ ℳ : subset of training data that are misclassified Many solutions?Which solution among them? 19

  19. Cost function 𝐾(𝒙) 𝐾 𝑄 (𝒙) 𝑥 0 𝑥 0 𝑥 1 𝑥 1 # of misclassifications Perceptron ’ s as a cost function cost function There may be many solutions in these cost functions 20 [Duda, Hart, and Stork, 2002]

  20. Batch Perceptron “ Gradient Descent ” to solve the optimization problem: 𝒙 𝑢+1 = 𝒙 𝑢 − 𝜃𝛼 𝒙 𝐾 𝑄 (𝒙 𝑢 ) 𝒚 𝑗 𝑧 𝑗 𝛼 𝒙 𝐾 𝑄 𝒙 = − 𝑗∈ℳ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 Until 𝜃 𝑗∈ℳ 𝒚 𝑗 𝑧 𝑗 < 𝜄 21

  21. Stochastic gradient descent for Perceptron  Single-sample perceptron:  If 𝒚 (𝑗) is misclassified: 𝒙 𝑢+1 = 𝒙 𝑢 + 𝜃𝒚 (𝑗) 𝑧 (𝑗)  Perceptron convergence theorem: for linearly separable data  If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize 𝒙, 𝑢 ← 0 repeat 𝜃 can be set to 1 and 𝑢 ← 𝑢 + 1 proof still works 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) Until all patterns properly classified 22

  22. Example 23

  23. Perceptron: Example Change 𝒙 in a direction that corrects the error 24 [Bishop]

  24. Convergence of Perceptron [Duda, Hart & Stork, 2002]  For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25

  25. Pocket algorithm  For the data that are not linearly separable due to noise:  Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑢 = 1, … , 𝑈 𝑗 ← 𝑢 mod 𝑂 if 𝒚 (𝑗) is misclassified then 𝒙 𝑜𝑓𝑥 = 𝒙 + 𝒚 (𝑗) 𝑧 (𝑗) if 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 𝑜𝑓𝑥 < 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 then 𝒙 = 𝒙 𝑜𝑓𝑥 end 𝑂 𝐹 𝑢𝑠𝑏𝑗𝑜 𝒙 = 1 𝑡𝑗𝑕𝑜(𝒙 𝑈 𝒚 (𝑜) ) ≠ 𝑧 (𝑜) 𝑂 𝑜=1 26

  26. Linear Discriminant Analysis (LDA)  Fisher ’ s Linear Discriminant Analysis :  Dimensionality reduction  Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables)  Classification  Predicts the class of an observation 𝒚 by first projecting it to the space of discriminant variables and then classifying it in this space 27

  27. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 28

  28. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 29

  29. Good Projection for Classification  What is a good criterion?  Separating different classes in the projected space 𝒙 30

Recommend


More recommend