 
              Linear Classifiers 1
Outline  Framework  “Exact”  Minimize Mistakes (Perceptron Training)  Matrix inversion  “Logistic Regression” Model  Max Likelihood Estimation (MLE) of P( y | x )  Gradient descent (MSE; MLE)  “Linear Discriminant Analysis”  Max Likelihood Estimation (MLE) of P( y, x )  Direct Computation 2
Diagnosing Butterfly-itis Hmmm… perhaps Butterfly-it is?? 3
Classifier: Decision Boundaries  Classifier : partitions input space X into “decision regions” - + - + #wings - - + + + - + - + - + #antennae  L inear threshold unit has a linear decision boundary  Defn: Set of points that can be separated by linear decision boundary is “linearly separable" 4
Linear Separators  Draw “separating line” - + + ? - + - s + g n i - w + # + + - - + - + - #antennae 2  If #antennae ≤ 2, then butterfly-itis  So ? is Not butterfly-itis. 5
Can be “angled”… - + - + ? - s + g n - i - w + # + + - + - + - + #antennae 2.3 × #w + 7.5 × #a + 1.2 = 0  If 2.3 × #Wings + 7.5 × #antennae + 1.2 > 0 then butterfly-itis 6
Linear Separators, in General  Given data (many features) F 1 F 2 F n Temp. Press … … Color diseaseX? Class No 35 95 … 3 No 35 95 … Pale 22 80 … -2 Yes 22 80 … Clear Yes : : : : : : : : No 10 50 … 1.9 No 10 50 … Pale  find “weights” {w 1 , w 2 , …, w n , w 0 } such that w 1 × F 1 + … + w n × F n + w 0 > 0 means Class = Yes 7
Linear Separator 35 F 1 w 1 Yes Σ i w i × F i w 2 F 2 95 : : w n No F n 3 Just view F 0 = 0, so w 0 … 8
Linear Separator 46.8 35 F 1 2.3 Yes Σ i w i × F i Yes 7.5 F 2 95 : : No 21 F n 3  Performance  Given {w i }, and values for instance, compute response  Learning  Given labeled data, find “correct” {w i }  Linear Threshold Unit … “Perceptron” 9
Linear Separators – Facts  GOOD NEWS:  If data is linearly separated,  Then FAST ALGORITHM finds correct {w i } !  But… - + - + 10
Linear Separators – Facts  GOOD NEWS:  If data is linearly separated,  Then FAST ALGORITHM finds correct {w i } !  But… - + - + Stay tuned!  Some “data sets” are NOT linearly separatable! 11
Geometric View ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 )  Consider 3 training examples: ( [2.0; 2.0]; 0 )  Want classifier that looks like. . . 12
Linear Equation is Hyperplane  Equation w · x = ∑ i w i ·x i is plane 1 if w · x > 0 y( x ) = 0 otherwise 13
Linear Threshold Unit: “Perceptron”  Squashing function: sgn: ℜ→ {-1, +1 } 1 if r > 0 sgn(r) = 0 otherwise (“heaviside”)  Actually w · x > b but. . . Create extra input x 0 fixed at 1 Corresponding w 0 corresponds to -b 14
Learning Perceptrons  Can represent Linearly-Separated surface . . . any hyper-plane between two half-spaces…  Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then ∃ learning alg guaranteed to quickly converge to f ! ⇒ enormous popularity, early / mid 60's  But some simple fns cannot be represented (Boolean XOR) [Minsky/Papert 1969] 15  Killed the field temporarily!
Perceptron Learning  Hypothesis space is. . .  Fixed Size: ∃ O(2 n^2 ) distinct perceptrons over n boolean features  Deterministic  Continuous Parameters  Learning algorithm:  Various: Local search, Direct computation, . . .  Eager  Online / Batch 16
Task  Input: labeled data Transformed to  Output: w ∈ℜ r+1 Goal : Want w s.t. ∀ i sgn( w · [1, x (i) ]) = y (i)  . . . minimize mistakes wrt data . . . 17
Error Function Given data { [x (i) , y (i) ] } i=1..m , optimize...  1. Classification error m err Class  w  = 1 m ∑ I [ y  i  ≠ o w  x  i   ] Perceptron Training; Matrix Inversion i = 1 m 1 err MSE  w  = 1  2. Mean-squared error m ∑ 2 [ y  i  − o w  x  i   ] 2 i = 1 Matrix Inversion; Gradient Descent m LCL  w  = 1  3. (Log) Conditional Probability m ∑ log P w  y  i  ∣ x  i   i = 1 MSE Gradient Descent; LCL Gradient Descent  4. (Log) Joint Probability m LL  w  = 1 m ∑ log P w  y  i  , x  i   i = 1 Direct Computation 18
#1: Optimal Classification Error  For each labeled instance [ x , y] Err = y – o w ( x ) y = f(x) is target value o w ( x ) = sgn( w · x ) is perceptron output  Idea : Move weights in appropriate direction, to push Err → 0  If Err > 0 (error on POSITIVE example)  need to increase sgn( w · x ) ⇒ need to increase w · x ) e p l m a x e  Input j contributes w j · x j to w · x E V I T A G E N n o r o r  if x j > 0, increasing w j will increase w · x r e ( 0 < r r E x f I –  ← w  if x j < 0, decreasing w j will increase w · x j w ⇒ j j ⇒ w j ← w j + x j 19
#1a: Mistake Bound Perceptron Alg Initialize w = 0 Weights Instance Action Do until bored [0 0 0] #1 +x Predict “+” iff w · x > 0 [1 0 0] #2 -x else “–" Mistake on positive: w ← w + x [0 -1 0] #3 +x Mistake on negative: w ← w – x [1 0 1] #1 OK [1 0 1] #2 -x [0 -1 1] #3 +x [1 0 2] #1 OK [1 0 2] #2 -x [0 -1 2] #3 OK [1 -1 2] #1 +x [1 -1 2] #2 OK [1 -1 2] #3 OK [1 -1 2] #1 OK 20
Mistake Bound Theorem Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w , then number of mistakes is ≤ (1/ ∆ ) 2 , where  ∆ measures “wiggle room” available: If |x| = 1, then ∆ is max, over all consistent planes, of minimum distance of example to that plane  w is ⊥ to separator, as w · x = 0 at boundary  So | w · x | is projection of x onto plane, PERPENDICULAR to boundary line … ie, is distance from x to that line (once normalized) 21
Proof of Convergence x wrong wrt w iff w · x < 0  Let w * be unit vector rep'ning target plane ∆ = min x { w * · x } Let w be hypothesis plane  Consider: w = Σ {x | x · w < 0 } x  On each mistake, add x to w 22
Proof (con't) If w is mistake… ∆ = min x { w * · x } w = Σ {x | x · w < 0 } x 23
#1b: Perceptron Training Rule For each labeled instance [ x , y]  Err( [ x , y] ) = y – o w ( x ) ∈ { -1, 0, +1 }  If Err( [ x , y] ) = 0 Correct! … Do nothing! ∆ w = 0 ≡ Err( [ x , y] ) · x  If Err( [ x , y] ) = +1 Mistake on positive! Increment by +x ∆ w = +x ≡ Err( [ x , y] ) · x  If Err( [ x , y] ) = -1 Mistake on negative! Increment by -x ∆ w = -x ≡ Err( [ x , y] ) · x In all cases... ∆ w (i) = Err( [ x (i) , y (i) ] ) · x (i) = [y (i) – o w ( x (i) )] · x (i) Batch Mode: do ALL updates at once! ∆ w j = ∑ i ∆ w j  (i) = ∑ i x (i) j ( y (i) – o w ( x (i) ) ) w j += η ∆ w j 24
0. Fix w 0. New w 1. For each row i, compute a. ∆ w = 0 b. E (i) = y (i) – o w ( x (i) ) c. ∆ w += E (i) x (i) [ … ∆ w j += E (i) x (i) j … ] feature j 2. Increment w += η ∆ w x (i) x (i) E (i) j ∆ w j ∆ w 25
Correctness  Rule is intuitive: Climbs in correct direction. . .  Thrm: Converges to correct answer, if . . .  training data is linearly separable  sufficiently small η  Proof: Weight space has EXACTLY 1 minimum ! (no non-global minima) ⇒ with enough examples, finds correct function!  Explains early popularity  If η too large, may overshoot If η too small, takes too long  So often η = η (k) … which decays with # of iterations, k 26
#1c: Matrix Version? 27
Issues 1. Why restrict to only y i ∈ { –1, +1 } ?  If from discrete set y i ∈ { 0, 1, …, m } : General (non-binary) classification  If ARBITRARY y i ∈ ℜ : Regression 2. What if NO w works? ...X is singular; overconstrained ... Could try to minimize residual NP-Hard! ∑ i I [ y (i) ≠ w · x (i) ] || y – X w || 1 = ∑ i | y (i) – w · x (i) | Easy! || y – X w || 2 = ∑ i ( y (i) – w · x (i) ) 2 28
L 2 error vs 0/1-Loss  “0/1 Loss function” not smooth, differentiable  MSE error is smooth, differentiable… and is overbound... 29
Gradient Descent for Perceptron?  Why not Gradient Descent for THRESHOLDed perceptron?  Needs gradient (derivative), not  Gradient Descent is General approach. Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters But. . . – can be slow (many iterations) – may only find LOCAL opt 30
#1. LMS version of Classifier  View as Regression  Find “best” linear mapping w from X to Y  w * = argmin Err LMS ( X , Y ) (w) ( X , Y ) (w) = ∑ i ( y (i) – w · x (i) ) 2  Err LMS  Threshold: if w.x > 0.5, return 1; else 0  See Chapter 3 31
Use Linear Regression for Classification? 1.Use regression to find weights w 2.Classify new instance x as sgn( w · x ) But … regression minimizes  sum of squared errors on target function … which gives strong influence to outliers Great separation Bad separation 32
Recommend
More recommend