Illustration: Frank Rosenblatt's Perceptron BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron Aykut Erdem // Hacettepe University // Fall 2019
• Assignment 2 is out! − It is due November 22 (i.e. in 2 weeks) − Implement Naive Bayes classifier for fake news detection 2 image credit: Frederick Burr Opper
Last time… Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for P(Y|X): Logis$c%func$on%applied%to%a%linear% Logistic function applied to linear func$on%of%the%data% function of the data logit%(z)% Logis&c( Logistic func&on( function slide by Aarti Singh & Barnabás Póczos (or Sigmoid): (or(Sigmoid):( z% Features(can(be(discrete(or(con&nuous!( Features can be discrete or continuous! 8% 3
Last time.. Logistic Regression vs. Gaussian Naïve Bayes • LR is a linear classifier − decision rule is a hyperplane • LR optimized by maximizing conditional likelihood − no closed-form solution − concave ! global optimum with gradient ascent • Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR − Solution di ff ers because of objective (loss) function • In general, NB and LR make di ff erent assumptions − NB: Features independent given class! assumption on P( X |Y) slide by Aarti Singh & Barnabás Póczos − LR: Functional form of P(Y| X ), no assumption on P( X |Y) • Convergence rates − GNB (usually) needs less data − LR (usually) gets to better solutions in the limit 4
Linear Discriminant Functions 5
Linear Discriminant Function • Linear discriminant function for a vector x y ( x ) = w T x + w 0 where w is called weight vector, and w 0 is a bias. • The classification function is C ( x ) = sign ( w T x + w 0 ) where step function sign(·) is defined as ( + 1 , a > 0 sign ( a ) = − 1 , a < 0 slide by Ce Liu 6
Properties of Linear Discriminant Functions • The decision surface, shown in red, x 2 y > 0 is perpendicular to w , and its y = 0 displacement from the origin is R 1 y < 0 R 2 controlled by the bias paramete r w 0 . • The signed orthogonal distance of x a general point x from the decision w y ( x ) surface is given by y ( x )/|| w || k w k x ? • y ( x ) gives a signed measure of the x 1 perpendicular distance r of the � w 0 point x from the decision surface k w k • y( x ) = 0 for x on the decision surface. The normal distance from the origin to the decision surface is w T x k w k = � w 0 k w k slide by Ce Liu • So w 0 determines the location of the decision surface. the decision surface. 7
x 2 y > 0 Properties of Linear y = 0 R 1 y < 0 R 2 Discriminant Functions x w y ( x ) k w k • Let x = x ⊥ + r w x ? x 1 k w k � w 0 k w k where x ⊥ is the projection x on the decision surface. Then w T x = w T x ⊥ + r w T w k w k w T x + w 0 = w T x ⊥ + w 0 + r k w k y ( x ) = r k w k r = y ( x ) k w k • Simpler notion: define and so that and e define e w = ( w 0 , w ) x = ( 1 , x ) e w T e y ( x ) = e slide by Ce Liu x 8
Multiple Classes: Simple Extension • One-versus-the-rest classifier: classify C k and samples not in C k . • One-versus-one classifier: classify every pair of classes. C 3 C 1 ? R 1 R 3 R 1 C 1 ? R 2 C 3 C 1 R 2 R 3 C 2 C 2 not C 1 slide by Ce Liu C 2 not C 2 9
Multiple Classes: K-Class Discriminant • A single K -class discriminant comprising K linear functions y k ( x ) = w T k x + w k 0 • Decision function C ( x ) = k , if y k ( x ) > y j ( x ) 8 j 6 = k • The decision boundary between class C k and C j is given by y k ( x ) = y j ( x ) C C ( w k � w j ) T x + ( w k 0 � w j 0 ) = 0 slide by Ce Liu 10
Fisher’s Linear Discriminant • Pursue the optimal linear projection on which the two classes can be maximally separated A way to view a linear y = w T x classification model is in • The mean vectors of the two classes terms of dimensionality reduction. m 1 = 1 m 2 = 1 X X x n , x n N 1 N 2 n ∈ C 1 n ∈ C 2 4 4 2 2 0 0 − 2 − 2 slide by Ce Liu − 2 2 6 − 2 2 6 Di ff erence of means Fisher’s Linear Discriminant 11
What’s a Good Projection? • After projection, the two classes are separated as much as possible. Measured by the distance between projected center ⌘ 2 ⇣ w T ( m 1 − m 2 ) = w T ( m 1 − m 2 )( m 1 − m 2 ) T w = w T S B w where S B = ( m 1 − m 2 )( m 1 − m 2 ) T is called between-class covariance matrix. • After projection, the variances of the two classes are as small as possible. Measured by the within-class covariance where w T S W w ( x n − m 1 )( x n − m 1 ) T + X X ( x n − m 2 )( x n − m 2 ) T S W = slide by Ce Liu n ∈ C 1 n ∈ C 2 12
Fisher’s Linear Discriminant Fisher criterion: maximize the ratio w.r.t. w • Within-class variance = w T S B w J ( w ) = Between-class variance w T S W w for f ( x ) = g ( x ) Recall the quotient rule: for • h ( x ) f 0 ( x ) = g 0 ( x ) h ( x ) � g ( x ) h 0 ( x ) h 2 ( x ) Setting ∇ J ( w ) = 0 , we obtain • ( w T S B w ) S W w = ( w T S W w ) S B w ⇣ ⌘ ( w T S B w ) S W w = ( w T S W w )( m 2 � m 1 ) ( m 2 � m 1 ) T w Terms w T S B w , w T S W w and ( m 2 − m 1 ) T w are scalars, and we only care • about directions. So the scalars are dropped. Therefore slide by Ce Liu w / S � 1 W ( m 2 � m 1 ) 13
From Fisher’s Linear Discriminant to Classifiers • Fisher’s Linear Discriminant is not a classifier; it only decides on an optimal projection to convert high-dimensional classification problem to 1D. • A bias (threshold) is needed to form a linear classifier (multiple thresholds lead to nonlinear classifiers). The final classifier has the form y ( x ) = sign ( w T x + w 0 ) where the nonlinear activation function sign(·) is a step · function ( + 1 , a > 0 sign ( a ) = − 1 , a < 0 • How to decide the bias w 0 ? slide by Ce Liu 14
Perceptron 15
early theories of the brain slide by Alex Smola
Biology and Learning • Basic Idea - Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness. - Killing a sabertooth tiger should be rewarded ... - Correlated events should be combined. - Pavlov’s salivating dog. • Training mechanisms - Behavioral modification of individuals (learning) Successful behavior is rewarded (e.g. food). - Hard-coded behavior in the genes (instinct) The wrongly coded animal does not reproduce. slide by Alex Smola 17
Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (cable) May be up to 1m long and will transport the activation slide by Alex Smola signal to neurons at di ff erent locations 18
Neurons x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output X slide by Alex Smola f ( x ) = w i x i = h w, x i i 19
Perceptron • Weighted linear x 3 x n x 1 x 2 . . . combination w n w 1 • Nonlinear synaptic decision function weights • Linear o ff set (bias) output f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes (spam/ham, novel/typical, click/no click) • Learning slide by Alex Smola Estimating the parameters w and b 20
Perceptron Ham Spam slide by Alex Smola 21
Perceptron Widom Rosenblatt slide by Alex Smola
The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly X • Weight vector is linear combination w = y i x i • Classifier is linear combination of i ∈ I inner products X f ( x ) = y i h x i , x i + b slide by Alex Smola i ∈ I 23
Convergence Theorem • If there exists some with unit length and ( w ∗ , b ∗ ) y i [ h x i , w ∗ i + b ∗ ] � ρ for all i then the perceptron converges to a linear separator after a number of steps bounded by b ∗ 2 + 1 ⇣ ⌘ � r 2 + 1 ρ − 2 where k x i k r � • Dimensionality independent • Order independent (i.e. also worst case) • Scales with ‘di ffi culty’ of problem slide by Alex Smola 24
Consequences • Only need to store errors. This gives a compression bound for perceptron. • Stochastic gradient descent on hinge loss l ( x i , y i , w, b ) = max (0 , 1 � y i [ h w, x i i + b ]) • Fails with noisy data do NOT train your avatar with perceptrons slide by Alex Smola Black & White 25
Hardness: margin vs. size slide by Alex Smola hard easy 26
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
Recommend
More recommend