bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 10: Linear - PowerPoint PPT Presentation

Illustration: Frank Rosenblatt's Perceptron BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron Aykut Erdem // Hacettepe University // Fall 2019 Assignment 2 is out! It is due November 22 (i.e.


  1. Illustration: Frank Rosenblatt's Perceptron BBM406 Fundamentals of 
 Machine Learning Lecture 10: Linear Discriminant Functions Perceptron Aykut Erdem // Hacettepe University // Fall 2019

  2. • Assignment 2 is out! − It is due November 22 (i.e. in 2 weeks) − Implement Naive Bayes classifier for fake news detection 2 image credit: Frederick Burr Opper

  3. Last time… Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y|X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic func&on( function slide by Aarti Singh & Barnabás Póczos (or Sigmoid): (or(Sigmoid):( z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 3

  4. Last time.. Logistic Regression vs. Gaussian Naïve Bayes • LR is a linear classifier − decision rule is a hyperplane • LR optimized by maximizing conditional likelihood − no closed-form solution − concave ! global optimum with gradient ascent • Gaussian Naïve Bayes with class-independent variances 
 representationally equivalent to LR − Solution di ff ers because of objective (loss) function • In general, NB and LR make di ff erent assumptions − NB: Features independent given class! assumption on P( X |Y) slide by Aarti Singh & Barnabás Póczos − LR: Functional form of P(Y| X ), no assumption on P( X |Y) • Convergence rates − GNB (usually) needs less data − LR (usually) gets to better solutions in the limit 4

  5. Linear Discriminant 
 Functions 5

  6. 
 
 
 
 Linear Discriminant Function • Linear discriminant function for a vector x y ( x ) = w T x + w 0 where w is called weight vector, and w 0 is a bias. • The classification function is 
 C ( x ) = sign ( w T x + w 0 ) where step function sign(·) is defined as ( + 1 , a > 0 sign ( a ) = − 1 , a < 0 slide by Ce Liu 6

  7. Properties of Linear Discriminant Functions • The decision surface, shown in red, x 2 y > 0 is perpendicular to w , and its y = 0 displacement from the origin is R 1 y < 0 R 2 controlled by the bias paramete r w 0 . • The signed orthogonal distance of x a general point x from the decision w y ( x ) surface is given by y ( x )/|| w || k w k x ? • y ( x ) gives a signed measure of the x 1 perpendicular distance r of the � w 0 point x from the decision surface k w k • y( x ) = 0 for x on the decision surface. The normal distance from the origin to the decision surface is w T x k w k = � w 0 k w k slide by Ce Liu • So w 0 determines the location of the decision surface. the decision surface. 7

  8. 
 
 
 
 
 
 
 x 2 y > 0 Properties of Linear y = 0 R 1 y < 0 R 2 Discriminant Functions x w y ( x ) k w k • Let 
 x = x ⊥ + r w x ? x 1 k w k � w 0 k w k where x ⊥ is the projection x on the decision surface. Then 
 w T x = w T x ⊥ + r w T w k w k w T x + w 0 = w T x ⊥ + w 0 + r k w k y ( x ) = r k w k r = y ( x ) k w k • Simpler notion: define and so that and e define e w = ( w 0 , w ) x = ( 1 , x ) e w T e y ( x ) = e slide by Ce Liu x 8

  9. Multiple Classes: Simple Extension • One-versus-the-rest classifier: classify C k and samples not in C k . • One-versus-one classifier: classify every pair of classes. C 3 C 1 ? R 1 R 3 R 1 C 1 ? R 2 C 3 C 1 R 2 R 3 C 2 C 2 not C 1 slide by Ce Liu C 2 not C 2 9

  10. Multiple Classes: K-Class Discriminant • A single K -class discriminant comprising K linear functions y k ( x ) = w T k x + w k 0 • Decision function C ( x ) = k , if y k ( x ) > y j ( x ) 8 j 6 = k • The decision boundary between class C k and C j is given by y k ( x ) = y j ( x ) C C ( w k � w j ) T x + ( w k 0 � w j 0 ) = 0 slide by Ce Liu 10

  11. 
 Fisher’s Linear Discriminant • Pursue the optimal linear projection on which the two classes can be maximally separated 
 A way to view a linear y = w T x classification model is in • The mean vectors of the two classes 
 terms of dimensionality reduction. m 1 = 1 m 2 = 1 X X x n , x n N 1 N 2 n ∈ C 1 n ∈ C 2 4 4 2 2 0 0 − 2 − 2 slide by Ce Liu − 2 2 6 − 2 2 6 Di ff erence of means Fisher’s Linear Discriminant 11

  12. 
 
 
 
 
 
 What’s a Good Projection? • After projection, the two classes are separated as much as possible. Measured by the distance between projected center 
 ⌘ 2 ⇣ w T ( m 1 − m 2 ) = w T ( m 1 − m 2 )( m 1 − m 2 ) T w = w T S B w where S B = ( m 1 − m 2 )( m 1 − m 2 ) T is called between-class covariance matrix. • After projection, the variances of the two classes are as small as possible. Measured by the within-class covariance 
 where 
 w T S W w ( x n − m 1 )( x n − m 1 ) T + X X ( x n − m 2 )( x n − m 2 ) T S W = slide by Ce Liu n ∈ C 1 n ∈ C 2 12

  13. Fisher’s Linear Discriminant Fisher criterion: maximize the ratio w.r.t. w • Within-class variance = w T S B w J ( w ) = Between-class variance w T S W w for f ( x ) = g ( x ) Recall the quotient rule: for • h ( x ) f 0 ( x ) = g 0 ( x ) h ( x ) � g ( x ) h 0 ( x ) h 2 ( x ) Setting ∇ J ( w ) = 0 , we obtain • ( w T S B w ) S W w = ( w T S W w ) S B w ⇣ ⌘ ( w T S B w ) S W w = ( w T S W w )( m 2 � m 1 ) ( m 2 � m 1 ) T w Terms w T S B w , w T S W w and ( m 2 − m 1 ) T w are scalars, and we only care • about directions. So the scalars are dropped. Therefore slide by Ce Liu w / S � 1 W ( m 2 � m 1 ) 13

  14. 
 
 From Fisher’s Linear Discriminant to Classifiers • Fisher’s Linear Discriminant is not a classifier; it only decides on an optimal projection to convert high-dimensional classification problem to 1D. • A bias (threshold) is needed to form a linear classifier (multiple thresholds lead to nonlinear classifiers). The final classifier has the form 
 y ( x ) = sign ( w T x + w 0 ) where the nonlinear activation function sign(·) is a step · function ( + 1 , a > 0 sign ( a ) = − 1 , a < 0 • How to decide the bias w 0 ? slide by Ce Liu 14

  15. Perceptron 15

  16. early theories of the brain slide by Alex Smola

  17. Biology and Learning • Basic Idea - Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness. - Killing a sabertooth tiger should be rewarded ... - Correlated events should be combined. - Pavlov’s salivating dog. 
 • Training mechanisms - Behavioral modification of individuals (learning) 
 Successful behavior is rewarded (e.g. food). - Hard-coded behavior in the genes (instinct) 
 The wrongly coded animal does not reproduce. slide by Alex Smola 17

  18. Neurons • Soma (CPU) 
 Cell body - combines signals 
 • Dendrite (input bus) 
 Combines the inputs from 
 several other nerve cells 
 • Synapse (interface) 
 Interface and parameter store between neurons 
 • Axon (cable) 
 May be up to 1m long and will transport the activation slide by Alex Smola signal to neurons at di ff erent locations 18

  19. Neurons x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output X slide by Alex Smola f ( x ) = w i x i = h w, x i i 19

  20. 
 Perceptron • Weighted linear 
 x 3 x n x 1 x 2 . . . combination w n w 1 • Nonlinear 
 synaptic decision function weights • Linear o ff set (bias) 
 output f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes 
 (spam/ham, novel/typical, click/no click) • Learning 
 slide by Alex Smola Estimating the parameters w and b 20

  21. Perceptron Ham Spam slide by Alex Smola 21

  22. Perceptron Widom Rosenblatt slide by Alex Smola

  23. The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly X • Weight vector is linear combination w = y i x i • Classifier is linear combination of 
 i ∈ I inner products X f ( x ) = y i h x i , x i + b slide by Alex Smola i ∈ I 23

  24. 
 
 
 Convergence Theorem • If there exists some with unit length and 
 ( w ∗ , b ∗ ) y i [ h x i , w ∗ i + b ∗ ] � ρ for all i then the perceptron converges to a linear separator after a number of steps bounded by 
 b ∗ 2 + 1 ⇣ ⌘ � r 2 + 1 ρ − 2 where k x i k  r � • Dimensionality independent • Order independent (i.e. also worst case) • Scales with ‘di ffi culty’ of problem slide by Alex Smola 24

  25. Consequences • Only need to store errors. 
 This gives a compression bound for perceptron. • Stochastic gradient descent on hinge loss 
 l ( x i , y i , w, b ) = max (0 , 1 � y i [ h w, x i i + b ]) • Fails with noisy data do NOT train your 
 avatar with perceptrons slide by Alex Smola Black & White 25

  26. Hardness: margin vs. size slide by Alex Smola hard easy 26

  27. slide by Alex Smola

  28. slide by Alex Smola

  29. slide by Alex Smola

  30. slide by Alex Smola

  31. slide by Alex Smola

  32. slide by Alex Smola

  33. slide by Alex Smola

  34. slide by Alex Smola

  35. slide by Alex Smola

  36. slide by Alex Smola

  37. slide by Alex Smola

  38. slide by Alex Smola

Recommend


More recommend