CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Linear Classifiers
Feature Vectors Hello, SP SPAM # free : 2 YOUR_NAME : 0 Do you want free printr or or MISSPELLED : 2 cartriges? Why pay more FROM_FRIEND : 0 when you can get them + ... ABSOLUTELY FREE! Just PIXEL-7,12 : 1 “2 “2” PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ... Some (Simplified) Biology • Ve Very loose se insp spiration: human neurons
Linear Classifiers • In Inputs s are fe feature values • Ea Each feature has s a we weight • Su Sum is the act activat ation • If If the activa vation is: s: w 1 f 1 S • Po Positive ve , output +1 +1 w 2 >0? f 2 w 3 • Ne Negative , output -1 f 3 Weights • Bi Bina nary case: compare features to a weight vector • Le Learni ning ng: figure out the weight vector from examples # free : 4 YOUR_NAME :-1 # free : 2 MISSPELLED : 1 YOUR_NAME : 0 FROM_FRIEND :-3 MISSPELLED : 2 ... FROM_FRIEND : 0 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 Do Dot t pr produ duct t po positive itive FROM_FRIEND : 1 me means the positive class ...
Decision Rules Binary Decision Rule • In In the sp space of feature ve vectors • Examples are points • Any weight vector is a hyperplane • One side corresponds to Y= Y=+1 • Other corresponds to Y= Y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2 0 ... -1 = HAM 0 1 free
Weight Updates Learning: Binary Perceptron • St Start with we weights = = 0 • Fo For each training instance: • Cl Classify with current weights • If If co correct ect: (i.e., y=y*), no change! • If If wrong: adjust the weight vector
Learning: Binary Perceptron • St Start with we weights = = 0 • Fo For each training instance: • Cl Classify with current weights • If If co correct ect: (i.e., y= y=y* y* ), no change! • If If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* y* is -1 . Examples: Perceptron • Separable Case
Multiclass Decision Rule • If If we e hav ave e multiple e cl clas asses es: • A we weig ight vector for each class: • Sc Score (activation) of a class y: • Prediction with hi highe ghest st sc scor ore wins Binary = multiclass where the negative class has weight zero Learning: Multiclass Perceptron Start with all we • St weights = = 0 • Pi Pick training examples one by one • Pr Predi dict with current weights • If If c correct: no change! • If If wr wrong: lower score of wrong answer, raise score of right answer
Example: Multiclass Perceptron Qu Question: What will the weights w be for each class after 3 updates? y 1 = “p “politics” , x 1 = “wi “win the vote” y 2 = “p “politics” , x 2 = “wi “win the election” y 3 = “s “sports” , x 3 = “wi “win the game” ” BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 0 game : 0 game : 0 vote : 0 vote : 0 vote : 0 the : 0 the : 0 the : 0 ... ... ... Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu sports f( x 1 ) = 1 y 1 = “p “politics” , x 1 = “wi w sp “win the vote” Prediction: Pr politics f( x 1 ) = 0 y 2 = “p “politics” , x 2 = “wi w po “win the election” “sports” “s (wr (wrong) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = 0 1 BIAS : 1 - 1 BIAS : 0 + 1 BIAS : 0 1 win : 0 - 1 win : 0 + 1 win : 0 f( x 1 ) = 0 game : 0 - 0 game : 0 + 0 game : 0 1 vote : 0 - 1 vote : 0 + 1 vote : 0 1 the : 0 - 1 the : 0 + 1 the : 0 ... ... ...
Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu w sp sports f( x 1 ) = -2 y 1 = “p “politics” , x 1 = “wi “win the vote” Prediction: Pr w po politics f( x 1 ) = 3 y 2 = “p “politics” , x 2 = “wi “win the election” “p “politics” (c (correct) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = -3 1 BIAS : 0 BIAS : 1 BIAS : 0 1 win : -1 win : 1 win : 0 f( x 2 ) = 0 game : 0 game : 0 game : 0 0 vote : -1 vote : 1 vote : 0 1 the : -1 the : 1 the : 0 ... ... ... Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu sports f( x 1 ) = -2 y 1 = “p “politics” , x 1 = “wi w sp “win the vote” Prediction: Pr politics f( x 1 ) = 3 y 2 = “p “politics” , x 2 = “wi w po “win the election” “p “politics” (wr (wrong) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = -3 1 BIAS : 0 + 1 BIAS : 1 - 1 BIAS : 0 1 win : -1 + 1 win : 1 - 1 win : 0 f( x 3 ) = 1 game : 0 + 1 game : 0 - 1 game : 0 0 vote : -1 + 0 vote : 1 - 0 vote : 0 1 the : -1 + 1 the : 1 - 1 the : 0 ... ... ...
Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu y 1 = “p “politics” , x 1 = “wi “win the vote” y 2 = “p “politics” , x 2 = “wi “win the election” y 3 = “s “sports” , x 3 = “wi “win the game” ” BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 1 game : -1 game : 0 vote : -1 vote : 1 vote : 0 the : 0 the : 0 the : 0 ... ... ... Properties of Perceptrons Separable • Se Separability: y: tr true if there exists weights w w that get the training set perfectly correct • Co Conv nvergenc nce: if the training data are se separable , a perceptron will eventually converge (binary case) δ • Mistake ke Bound: the maximum number of mistakes (updates) Non-Separable (binary case) is related to the num number of featur ures k and the ma margin δ or degree of separability
Problems with the Perceptron • No Noise: if the data isn’t separable, weights might thrash • Av Averaging weight vectors over time can help (averaged perceptron) ation: finds a • Med Mediocr cre e gen ener eral alizat “barely” separating solution • Ov Overtraining: te test t / held-ou out t accuracy usually rises, then falls • Overtraining is a kind of overfitting Improving the Perceptron
Non-Separable Case: Deterministic Decision Even the best linear boundary makes at least one mistake Non-Separable Case: Probabilistic Decision 0.9 | 0.1 0.7 | 0.3 0.5 | 0.5 0.3 | 0.7 0.1 | 0.9
How to get probabilistic decisions? • Pe Perceptron on scor orin ing: g: z = w · f ( x ) • If If very po positive à want probability going to 1 z = w · f ( x ) very ne ive à want probability going to 0 • If If negative z = w · f ( x ) • Sigmoid Sigmoid fu function ion 1 φ ( z ) = 1 + e − z Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i 1 P ( y ( i ) = +1 | x ( i ) ; w ) = wi with th: 1 + e − w · f ( x ( i ) ) 1 P ( y ( i ) = − 1 | x ( i ) ; w ) = 1 − 1 + e − w · f ( x ( i ) ) Th This is is is calle lled Lo Logis istic ic Regressio ion
Separable Case: Deterministic Decision – Many Options Separable Case: Probabilistic Decision – Clear Preference 0.7 | 0.3 0.5 | 0.5 0.7 | 0.3 0.3 | 0.7 0.5 | 0.5 0.3 | 0.7
Multiclass Logistic Regression • Re Recall Perceptron: n: • A we weig ight vector for each class: • Sc Score (activation) of a class y: • Prediction with hi highe ghest st sc scor ore wins ns • Ho How w to tur urn n sc scores s in into pr proba babi bilities? ? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 → e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i e w y ( i ) · f ( x ( i ) ) P ( y ( i ) | x ( i ) ; w ) = with wi th: y e w y · f ( x ( i ) ) P Th This is is is calle lled Mu Multi-Cl Class L ss Logist stic R Regressi ssion
Next Lecture • Op Opti timizati tion • i.e., how do we solve: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i
Recommend
More recommend