linear classifiers
play

Linear classifiers CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification


  1. Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019

  2. Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2

  3. Classification problem } Given:Training set * π’š % , 𝑧 % } labeled set of 𝑂 input-output pairs 𝐸 = %() } 𝑧 ∈ {1, … , 𝐿} } Goal: Given an input π’š , assign it to one of 𝐿 classes } Examples: } Spam filter } Handwritten digit recognition 3

  4. Linear classifiers } Decision boundaries are linear in π’š , or linear in some given set of functions of π’š } Linearly separable data: data points that can be exactly classified by a linear decision surface. } Why linear classifier? } Even when they are not optimal, we can use their simplicity } are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. 4

  5. Two Category } 𝑔 π’š; 𝒙 = 𝒙 4 π’š + π‘₯ 7 = π‘₯ 7 + π‘₯ ) 𝑦 ) + . . . π‘₯ ; 𝑦 ; } π’š = 𝑦 ) 𝑦 < … 𝑦 ; } 𝒙 = [π‘₯ ) π‘₯ < … π‘₯ ; ] } π‘₯ 7 : bias } if 𝒙 4 π’š + π‘₯ 7 β‰₯ 0 then π’Ÿ ) } else π’Ÿ < Decision surface (boundary): 𝒙 4 π’š + π‘₯ 7 = 0 𝒙 is orthogonal to every vector lying within the decision surface 5

  6. Example 3 βˆ’ 3 4 𝑦 ) βˆ’ 𝑦 < = 0 𝑦 2 3 if 𝒙 4 π’š + π‘₯ 7 β‰₯ 0 then π’Ÿ ) 2 else π’Ÿ < 1 𝑦 1 4 1 2 3 6

  7. Linear classifier: Two Category } Decision boundary is a ( 𝑒 βˆ’ 1 )-dimensional hyperplane 𝐼 in the 𝑒 -dimensional feature space } The orientation of 𝐼 is determined by the normal vector π‘₯ ) , … , π‘₯ ; } π‘₯ 7 determines the location of the surface. } The normal distance from the origin to the decision surface is H I 𝒙 π’š = π’š J + 𝑠 𝒙 𝒙 β‡’ 𝑠 = 𝒙 4 π’š + π‘₯ 7 π’š J 𝒙 4 π’š + π‘₯ 7 = 𝑠 𝒙 𝒙 𝑔 π’š = 0 gives a signed measure of the perpendicular distance 𝑠 of the point π’š from the decision surface 7

  8. Linear boundary: geometry 𝒙 4 π’š + π‘₯ 7 > 0 𝒙 4 π’š + π‘₯ 7 = 0 𝒙 4 π’š + π‘₯ 7 < 0 𝒙 4 π’š + π‘₯ 7 𝒙 8

  9. Non-linear decision boundary } Choose non-linear features } Classifier still linear in parameters 𝒙 < + 𝑦 < < = 0 𝑦 2 βˆ’1 + 𝑦 ) < , π’š < < , π’š ) π’š < ] 𝝔 π’š = [1, π’š ) , π’š < , π’š ) 1 𝒙 = π‘₯ 7 , π‘₯ ) , … , π‘₯ R = [βˆ’1, 0, 0,1,1,0] 𝑦 1 1 1 - 1 if 𝒙 4 𝝔(π’š) β‰₯ 0 then 𝑧 = 1 else 𝑧 = βˆ’1 π’š = [π’š ) , π’š < ] 9

  10. Cost Function for linear classification } Finding linear classifiers can be formulated as an optimization problem : } Select how to measure the prediction loss S π’š % , 𝑧 % } Based on the training set 𝐸 = , a cost function 𝐾 𝒙 is defined %() } Solve the resulting optimization problem to find parameters: U π’š = 𝑔 π’š; 𝒙 } Find optimal 𝑔 V where 𝒙 V = argmin 𝐾 𝒙 𝒙 } Criterion or cost functions for classification: } We will investigate several cost functions for the classification problem 10

  11. SSE cost function for classification 𝐿 = 2 SSE cost function is not suitable for classification: } Least square loss penalizes β€˜too correct’ predictions (that they lie a long way on the correct side of the decision) } Least square loss also lack robustness to noise * 𝑧 ∈ {βˆ’1, +1} < 𝐾 𝒙 = ] 𝒙 4 π’š % βˆ’ 𝑧 % %() 11

  12. SSE cost function for classification 𝐿 = 2 𝒙 4 π’š βˆ’ 𝑧 < 𝑧 = 1 1 𝒙 4 π’š Correct predictions that 𝒙 4 π’š βˆ’ 𝑧 < are penalized by SSE 𝑧 = βˆ’1 [Bishop] βˆ’1 𝒙 4 π’š 12

  13. SSE cost function for classification 𝐿 = 2 } Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝑕 𝒙 4 π’š ? * sign 𝒙 4 π’š βˆ’ 𝑧 < < 𝐾 𝒙 = ] sign 𝒙 4 π’š % βˆ’ 𝑧 % 𝑧 = 1 %() sign 𝑨 = aβˆ’ 1, 𝑨 < 0 1, 𝑨 β‰₯ 0 𝒙 4 π’š } 𝐾 𝒙 is a piecewise constant function shows the number of misclassifications 𝐾(𝒙) Training error incurred in classifying training samples 13

  14. SSE cost function 𝐿 = 2 } Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝜏 𝒙 4 π’š ? * < 𝐾 𝒙 = ] 𝜏 𝒙 4 π’š % βˆ’ 𝑧 % %() 𝜏 𝑨 = 1 βˆ’ 𝑓 de 1 + 𝑓 de } We see later in this lecture than the cost function of the logistic regression method is more suitable than this cost function for the classification problem 14

  15. Perceptron algorithm } Linear classifier } Two-class: 𝑧 ∈ {βˆ’1,1} } 𝑧 = βˆ’1 for 𝐷 < , 𝑧 = 1 for 𝐷 ) } Goal: βˆ€π‘—, π’š (%) ∈ 𝐷 ) β‡’ 𝒙 4 π’š (%) > 0 βˆ€π‘—, π’š % ∈ 𝐷 < β‡’ 𝒙 4 π’š % < 0 } } 𝑔 π’š; 𝒙 = sign(𝒙 4 π’š) 15

  16. οΏ½ Perceptron criterion 𝐾 i 𝒙 = βˆ’ ] 𝒙 4 π’š % 𝑧 % %βˆˆβ„³ β„³ : subset of training data that are misclassified Many solutions? Which solution among them? 16

  17. Cost function 𝐾(𝒙) 𝐾 i (𝒙) π‘₯ 7 π‘₯ 7 π‘₯ ) π‘₯ ) # of misclassifications Perceptron’s as a cost function cost function There may be many solutions in these cost functions 17 [Duda, Hart, and Stork, 2002]

  18. οΏ½ οΏ½ οΏ½ Batch Perceptron β€œGradient Descent” to solve the optimization problem: 𝒙 lm) = 𝒙 l βˆ’ πœƒπ›Ό 𝒙 𝐾 i (𝒙 l ) 𝒙 𝐾 i 𝒙 = βˆ’ ] π’š % 𝑧 % 𝛼 %βˆˆβ„³ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize 𝒙 Repeat π’š % 𝑧 % 𝒙 = 𝒙 + πœƒ βˆ‘ %βˆˆβ„³ π’š % 𝑧 % Until πœƒ βˆ‘ < πœ„ %βˆˆβ„³ 18

  19. Stochastic gradient descent for Perceptron } Single-sample perceptron: } If π’š (%) is misclassified: 𝒙 lm) = 𝒙 l + πœƒπ’š (%) 𝑧 (%) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron 𝒙 ← 𝟏 𝑒 ← 0 πœƒ can be set to 1 and repeat proof still works 𝑒 ← 𝑒 + 1 𝑗 ← 𝑒 mod 𝑂 if π’š (%) is misclassified then 𝒙 = 𝒙 + π’š (%) 𝑧 (%) 19 Until all patterns properly classified

  20. Example 20

  21. Perceptron: Example Change 𝒙 in a direction that corrects the error 21 [Bishop]

  22. Online vs. Batch Learning } Batch Learning } Learn from all the examples at once } Online Learning } Gradually learn as each example is received } Email classification example } Recommendation systems } Examples: recommending movies; predicting whether a user will be interested in a new news article 22

  23. Perceptron Convergence } Assume that βˆƒπ’™ βˆ— , 𝒙 βˆ— = 1 and some 𝛿 > 0 such that 𝑧 (S) 𝒙 βˆ—4 π’š (S) β‰₯ 𝛿 for all π‘œ = 1, … , 𝑂 . Also, π’š (S) π‘œ = 1, … , 𝑂, ≀ 𝑆 , assume that for all } ~ Perceptron makes at most β€’ ~ errors 23

  24. Perceptron Convergence (more general case) } It can be shown: (π’š,β€š)βˆˆΖ’ π’š < 𝑁 = max V 0 4 π’š 𝜈 = 2 min (π’š,β€š)βˆˆΖ’ 𝑧𝒙 V βˆ—4 π’š 𝑏 = min (π’š,β€š)βˆˆΖ’ 𝑧𝒙 24

  25. Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 25

  26. Pocket algorithm } For the data that are not linearly separable due to noise: } Keeps in its pocket the best 𝒙 encountered up to now. Initialize 𝒙 for 𝑒 = 1, … , π‘ˆ 𝑗 ← 𝑒 mod 𝑂 if π’š (%) is misclassified then 𝒙 S‑H = 𝒙 + π’š (%) 𝑧 (%) if 𝐹 l‰Š%S 𝒙 S‑H < 𝐹 l‰Š%S 𝒙 then 𝒙 = 𝒙 S‑H end * 𝐹 l‰Š%S 𝒙 = 1 𝑂 ] π‘‘π‘—π‘•π‘œ(𝒙 4 π’š (S) ) β‰  𝑧 (S) S() 26

  27. Linear Discriminant Analysis (LDA) } Fisher’s Linear Discriminant Analysis : } Dimensionality reduction } Finds linear combinations of features with large ratios of between- groups scatters to within-groups scatters (as discriminant new variables) } Classification } Predicts the class of an observation π’š by first projecting it to the space of discriminant variables and then classifying it in this space 27

  28. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 28

  29. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 29

  30. Good Projection for Classification } What is a good criterion? } Separating different classes in the projected space 𝒙 30

  31. LDA Problem } Problem definition: } 𝐿 = 2 classes * π’š (%) , 𝑧 (%) training samples with 𝑂 ) samples from the first class ( π’Ÿ ) ) } %() and 𝑂 < samples from the second class ( π’Ÿ < ) } Goal: finding the best direction 𝒙 that we hope to enable accurate classification } The projection of sample π’š onto a line in direction 𝒙 is 𝒙 4 π’š } What is the measure of the separation between the projected points of different classes? 31

  32. Measure of Separation in the Projected Direction } Is the direction of the line jointing the class means a good candidate for 𝒙 ? [Bishop] 32

Recommend


More recommend