linear classifiers
play

Linear Classifiers 1 Outline Framework Exact Minimize Mistakes - PowerPoint PPT Presentation

Linear Classifiers 1 Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion Logistic Regression Model Max Likelihood Estimation (MLE) of P( y | x ) Gradient descent (MSE; MLE)


  1. Linear Classifiers 1

  2. Outline  Framework  “Exact”  Minimize Mistakes (Perceptron Training)  Matrix inversion  “Logistic Regression” Model  Max Likelihood Estimation (MLE) of P( y | x )  Gradient descent (MSE; MLE)  “Linear Discriminant Analysis”  Max Likelihood Estimation (MLE) of P( y, x )  Direct Computation 2

  3. Diagnosing Butterfly-itis Hmmm… perhaps Butterfly-it is?? 3

  4. Classifier: Decision Boundaries  Classifier : partitions input space X into “decision regions” - + - + #wings - - + + + - + - + - + #antennae  L inear threshold unit has a linear decision boundary  Defn: Set of points that can be separated by linear decision boundary is “linearly separable" 4

  5. Linear Separators  Draw “separating line” - + + ? - + - s + g n i - w + # + + - - + - + - #antennae 2  If #antennae ≤ 2, then butterfly-itis  So ? is Not butterfly-itis. 5

  6. Can be “angled”… - + - + ? - s + g n - i - w + # + + - + - + - + #antennae 2.3 × #w + 7.5 × #a + 1.2 = 0  If 2.3 × #Wings + 7.5 × #antennae + 1.2 > 0 then butterfly-itis 6

  7. Linear Separators, in General  Given data (many features) F 1 F 2 F n Temp. Press … … Color diseaseX? Class No 35 95 … 3 No 35 95 … Pale 22 80 … -2 Yes 22 80 … Clear Yes : : : : : : : : No 10 50 … 1.9 No 10 50 … Pale  find “weights” {w 1 , w 2 , …, w n , w 0 } such that w 1 × F 1 + … + w n × F n + w 0 > 0 means Class = Yes 7

  8. Linear Separator 35 F 1 w 1 Yes Σ i w i × F i w 2 F 2 95 : : w n No F n 3 Just view F 0 = 0, so w 0 … 8

  9. Linear Separator 46.8 35 F 1 2.3 Yes Σ i w i × F i Yes 7.5 F 2 95 : : No 21 F n 3  Performance  Given {w i }, and values for instance, compute response  Learning  Given labeled data, find “correct” {w i }  Linear Threshold Unit … “Perceptron” 9

  10. Linear Separators – Facts  GOOD NEWS:  If data is linearly separated,  Then FAST ALGORITHM finds correct {w i } !  But… - + - + 10

  11. Linear Separators – Facts  GOOD NEWS:  If data is linearly separated,  Then FAST ALGORITHM finds correct {w i } !  But… - + - + Stay tuned!  Some “data sets” are NOT linearly separatable! 11

  12. Geometric View ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 )  Consider 3 training examples: ( [2.0; 2.0]; 0 )  Want classifier that looks like. . . 12

  13. Linear Equation is Hyperplane  Equation w · x = ∑ i w i ·x i is plane 1 if w · x > 0 y( x ) = 0 otherwise 13

  14. Linear Threshold Unit: “Perceptron”  Squashing function: sgn: ℜ→ {-1, +1 } 1 if r > 0 sgn(r) = 0 otherwise (“heaviside”)  Actually w · x > b but. . . Create extra input x 0 fixed at 1 Corresponding w 0 corresponds to -b 14

  15. Learning Perceptrons  Can represent Linearly-Separated surface . . . any hyper-plane between two half-spaces…  Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then ∃ learning alg guaranteed to quickly converge to f ! ⇒ enormous popularity, early / mid 60's  But some simple fns cannot be represented (Boolean XOR) [Minsky/Papert 1969] 15  Killed the field temporarily!

  16. Perceptron Learning  Hypothesis space is. . .  Fixed Size: ∃ O(2 n^2 ) distinct perceptrons over n boolean features  Deterministic  Continuous Parameters  Learning algorithm:  Various: Local search, Direct computation, . . .  Eager  Online / Batch 16

  17. Task  Input: labeled data Transformed to  Output: w ∈ℜ r+1 Goal : Want w s.t. ∀ i sgn( w · [1, x (i) ]) = y (i)  . . . minimize mistakes wrt data . . . 17

  18. Error Function Given data { [x (i) , y (i) ] } i=1..m , optimize...  1. Classification error m err Class  w  = 1 m ∑ I [ y  i  ≠ o w  x  i   ] Perceptron Training; Matrix Inversion i = 1 m 1 err MSE  w  = 1  2. Mean-squared error m ∑ 2 [ y  i  − o w  x  i   ] 2 i = 1 Matrix Inversion; Gradient Descent m LCL  w  = 1  3. (Log) Conditional Probability m ∑ log P w  y  i  ∣ x  i   i = 1 MSE Gradient Descent; LCL Gradient Descent  4. (Log) Joint Probability m LL  w  = 1 m ∑ log P w  y  i  , x  i   i = 1 Direct Computation 18

  19. #1: Optimal Classification Error  For each labeled instance [ x , y] Err = y – o w ( x ) y = f(x) is target value o w ( x ) = sgn( w · x ) is perceptron output  Idea : Move weights in appropriate direction, to push Err → 0  If Err > 0 (error on POSITIVE example)  need to increase sgn( w · x ) ⇒ need to increase w · x ) e p l m a x e  Input j contributes w j · x j to w · x E V I T A G E N n o r o r  if x j > 0, increasing w j will increase w · x r e ( 0 < r r E x f I –  ← w  if x j < 0, decreasing w j will increase w · x j w ⇒ j j ⇒ w j ← w j + x j 19

  20. #1a: Mistake Bound Perceptron Alg Initialize w = 0 Weights Instance Action Do until bored [0 0 0] #1 +x Predict “+” iff w · x > 0 [1 0 0] #2 -x else “–" Mistake on positive: w ← w + x [0 -1 0] #3 +x Mistake on negative: w ← w – x [1 0 1] #1 OK [1 0 1] #2 -x [0 -1 1] #3 +x [1 0 2] #1 OK [1 0 2] #2 -x [0 -1 2] #3 OK [1 -1 2] #1 +x [1 -1 2] #2 OK [1 -1 2] #3 OK [1 -1 2] #1 OK 20

  21. Mistake Bound Theorem Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w , then number of mistakes is ≤ (1/ ∆ ) 2 , where  ∆ measures “wiggle room” available: If |x| = 1, then ∆ is max, over all consistent planes, of minimum distance of example to that plane  w is ⊥ to separator, as w · x = 0 at boundary  So | w · x | is projection of x onto plane, PERPENDICULAR to boundary line … ie, is distance from x to that line (once normalized) 21

  22. Proof of Convergence x wrong wrt w iff w · x < 0  Let w * be unit vector rep'ning target plane ∆ = min x { w * · x } Let w be hypothesis plane  Consider: w = Σ {x | x · w < 0 } x  On each mistake, add x to w 22

  23. Proof (con't) If w is mistake… ∆ = min x { w * · x } w = Σ {x | x · w < 0 } x 23

  24. #1b: Perceptron Training Rule For each labeled instance [ x , y]  Err( [ x , y] ) = y – o w ( x ) ∈ { -1, 0, +1 }  If Err( [ x , y] ) = 0 Correct! … Do nothing! ∆ w = 0 ≡ Err( [ x , y] ) · x  If Err( [ x , y] ) = +1 Mistake on positive! Increment by +x ∆ w = +x ≡ Err( [ x , y] ) · x  If Err( [ x , y] ) = -1 Mistake on negative! Increment by -x ∆ w = -x ≡ Err( [ x , y] ) · x In all cases... ∆ w (i) = Err( [ x (i) , y (i) ] ) · x (i) = [y (i) – o w ( x (i) )] · x (i) Batch Mode: do ALL updates at once! ∆ w j = ∑ i ∆ w j  (i) = ∑ i x (i) j ( y (i) – o w ( x (i) ) ) w j += η ∆ w j 24

  25. 0. Fix w 0. New w 1. For each row i, compute a. ∆ w = 0 b. E (i) = y (i) – o w ( x (i) ) c. ∆ w += E (i) x (i) [ … ∆ w j += E (i) x (i) j … ] feature j 2. Increment w += η ∆ w x (i) x (i) E (i) j ∆ w j ∆ w 25

  26. Correctness  Rule is intuitive: Climbs in correct direction. . .  Thrm: Converges to correct answer, if . . .  training data is linearly separable  sufficiently small η  Proof: Weight space has EXACTLY 1 minimum ! (no non-global minima) ⇒ with enough examples, finds correct function!  Explains early popularity  If η too large, may overshoot If η too small, takes too long  So often η = η (k) … which decays with # of iterations, k 26

  27. #1c: Matrix Version? 27

  28. Issues 1. Why restrict to only y i ∈ { –1, +1 } ?  If from discrete set y i ∈ { 0, 1, …, m } : General (non-binary) classification  If ARBITRARY y i ∈ ℜ : Regression 2. What if NO w works? ...X is singular; overconstrained ... Could try to minimize residual NP-Hard! ∑ i I [ y (i) ≠ w · x (i) ] || y – X w || 1 = ∑ i | y (i) – w · x (i) | Easy! || y – X w || 2 = ∑ i ( y (i) – w · x (i) ) 2 28

  29. L 2 error vs 0/1-Loss  “0/1 Loss function” not smooth, differentiable  MSE error is smooth, differentiable… and is overbound... 29

  30. Gradient Descent for Perceptron?  Why not Gradient Descent for THRESHOLDed perceptron?  Needs gradient (derivative), not  Gradient Descent is General approach. Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters But. . . – can be slow (many iterations) – may only find LOCAL opt 30

  31. #1. LMS version of Classifier  View as Regression  Find “best” linear mapping w from X to Y  w * = argmin Err LMS ( X , Y ) (w) ( X , Y ) (w) = ∑ i ( y (i) – w · x (i) ) 2  Err LMS  Threshold: if w.x > 0.5, return 1; else 0  See Chapter 3 31

  32. Use Linear Regression for Classification? 1.Use regression to find weights w 2.Classify new instance x as sgn( w · x ) But … regression minimizes  sum of squared errors on target function … which gives strong influence to outliers Great separation Bad separation 32

Recommend


More recommend